Abstract:Contrastive deep clustering has recently gained significant attention with its ability of joint contrastive learning and clustering via deep neural networks. Despite the rapid progress, previous works mostly require both positive and negative sample pairs for contrastive clustering, which rely on a relative large batch-size. Moreover, they typically adopt a two-stream architecture with two augmented views, which overlook the possibility and potential benefits of multi-stream architectures (especially with heterogeneous or hybrid networks). In light of this, this paper presents a new end-to-end deep clustering approach termed Heterogeneous Tri-stream Clustering Network (HTCN). The tri-stream architecture in HTCN consists of three main components, including two weight-sharing online networks and a target network, where the parameters of the target network are the exponential moving average of that of the online networks. Notably, the two online networks are trained by simultaneously (i) predicting the instance representations of the target network and (ii) enforcing the consistency between the cluster representations of the target network and that of the two online networks. Experimental results on four challenging image datasets demonstrate the superiority of HTCN over the state-of-the-art deep clustering approaches. The code is available at https://github.com/dengxiaozhi/HTCN.
Abstract:Deep clustering has attracted increasing attention in recent years due to its capability of joint representation learning and clustering via deep neural networks. In its latest developments, the contrastive learning has emerged as an effective technique to substantially enhance the deep clustering performance. However, the existing contrastive learning based deep clustering algorithms mostly focus on some carefully-designed augmentations (often with limited transformations to preserve the structure), referred to as weak augmentations, but cannot go beyond the weak augmentations to explore the more opportunities in stronger augmentations (with more aggressive transformations or even severe distortions). In this paper, we present an end-to-end deep clustering approach termed strongly augmented contrastive clustering (SACC), which extends the conventional two-augmentation-view paradigm to multiple views and jointly leverages strong and weak augmentations for strengthened deep clustering. Particularly, we utilize a backbone network with triply-shared weights, where a strongly augmented view and two weakly augmented views are incorporated. Based on the representations produced by the backbone, the weak-weak view pair and the strong-weak view pairs are simultaneously exploited for the instance-level contrastive learning (via an instance projector) and the cluster-level contrastive learning (via a cluster projector), which, together with the backbone, can be jointly optimized in a purely unsupervised manner. Experimental results on five challenging image datasets have shown the superior performance of the proposed SACC approach over the state-of-the-art.