This work addresses a problem named video co-localization that aims at localizing the objects in videos jointly. Although there are numerous cues available for this purpose, for example, saliency, motion, and joint, their robust fusion can be quite challenging at times due to their spatial inconsistencies. To overcome this, in this paper, we propose a novel appearance fusion method where we fuse appearance models derived from these cues rather than spatially fusing their maps. In this method, we evaluate the cues in terms of their reliability and consensus to guide the appearance fusion process. We also develop a novel joint cue relying on topological hierarchy. We utilize the final fusion results to produce a few candidate bounding boxes and for subsequent optimal selection among them while considering the spatiotemporal constraints. The proposed method achieves promising results on the YouTube Objects dataset.