One of the fundamental requirements for visual surveillance using non-overlapping camera networks is the correct labeling of tracked objects on each camera in a consistent way,in the sense that the captured tracklets, or observations in this paper, of the same object at different cameras should be assigned with the same label. In this paper, we formulate this task as a Bayesian inference problem and propose a distributed inference framework in which the posterior distribution of labeling variable corresponding to each observation, conditioned on all history appearance and spatio-temporal evidence made in the whole networks, is calculated based solely on local information processing on each camera and mutual information exchanging between neighboring cameras. In our framework, the number of objects presenting in the monitored region, i.e. the sampling space of labeling variables, does not need to be specified beforehand. Instead, it can be determined automatically on the fly. In addition, we make no assumption about the appearance distribution of a single object, but use similarity scores between appearance pairs, given by advanced object re-identification algorithm, as appearance likelihood for inference. This feature makes our method very flexible and competitive when observing condition undergoes large changes across camera views. To cope with the problem of missing detection, which is critical for distributed inference, we consider an enlarged neighborhood of each camera during inference and use a mixture model to describe the higher order spatio-temporal constraints. The robustness of the algorithm against missing detection is improved at the cost of slightly increased computation and communication burden at each camera node. Finally, we demonstrate the effectiveness of our method through experiments on an indoor Office Building dataset and an outdoor Campus Garden dataset.