Abstract:In partial label learning (PLL), every sample is associated with a candidate label set comprising the ground-truth label and several noisy labels. The conventional PLL assumes the noisy labels are randomly generated (instance-independent), while in practical scenarios, the noisy labels are always instance-dependent and are highly related to the sample features, leading to the instance-dependent partial label learning (IDPLL) problem. Instance-dependent noisy label is a double-edged sword. On one side, it may promote model training as the noisy labels can depict the sample to some extent. On the other side, it brings high label ambiguity as the noisy labels are quite undistinguishable from the ground-truth label. To leverage the nuances of IDPLL effectively, for the first time we create class-wise embeddings for each sample, which allow us to explore the relationship of instance-dependent noisy labels, i.e., the class-wise embeddings in the candidate label set should have high similarity, while the class-wise embeddings between the candidate label set and the non-candidate label set should have high dissimilarity. Moreover, to reduce the high label ambiguity, we introduce the concept of class prototypes containing global feature information to disambiguate the candidate label set. Extensive experimental comparisons with twelve methods on six benchmark data sets, including four fine-grained data sets, demonstrate the effectiveness of the proposed method. The code implementation is publicly available at https://github.com/Yangfc-ML/CEL.
Abstract:Symmetric nonnegative matrix factorization (SymNMF) is a powerful tool for clustering, which typically uses the $k$-nearest neighbor ($k$-NN) method to construct similarity matrix. However, $k$-NN may mislead clustering since the neighbors may belong to different clusters, and its reliability generally decreases as $k$ grows. In this paper, we construct the similarity matrix as a weighted $k$-NN graph with learnable weight that reflects the reliability of each $k$-th NN. This approach reduces the search space of the similarity matrix learning to $n - 1$ dimension, as opposed to the $\mathcal{O}(n^2)$ dimension of existing methods, where $n$ represents the number of samples. Moreover, to obtain a discriminative similarity matrix, we introduce a dissimilarity matrix with a dual structure of the similarity matrix, and propose a new form of orthogonality regularization with discussions on its geometric interpretation and numerical stability. An efficient alternative optimization algorithm is designed to solve the proposed model, with theoretically guarantee that the variables converge to a stationary point that satisfies the KKT conditions. The advantage of the proposed model is demonstrated by the comparison with nine state-of-the-art clustering methods on eight datasets. The code is available at \url{https://github.com/lwl-learning/LSDGSymNMF}.
Abstract:Ensemble clustering aggregates multiple weak clusterings to achieve a more accurate and robust consensus result. The Co-Association matrix (CA matrix) based method is the mainstream ensemble clustering approach that constructs the similarity relationships between sample pairs according the weak clustering partitions to generate the final clustering result. However, the existing methods neglect that the quality of cluster is related to its size, i.e., a cluster with smaller size tends to higher accuracy. Moreover, they also do not consider the valuable dissimilarity information in the base clusterings which can reflect the varying importance of sample pairs that are completely disconnected. To this end, we propose the Similarity and Dissimilarity Guided Co-association matrix (SDGCA) to achieve ensemble clustering. First, we introduce normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation. Then, we employ the random walk to explore high-order proximity of base clusterings to construct a dissimilarity matrix. Finally, the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering. We compared our method with 13 state-of-the-art methods across 12 datasets, and the results demonstrated the superiority clustering ability and robustness of the proposed approach. The code is available at https://github.com/xuz2019/SDGCA.
Abstract:Spectral variation is a common problem for hyperspectral image (HSI) representation. Low-rank tensor representation is an important approach to alleviate spectral variations. However, the spatial distribution of the HSI is always irregular, while the previous tensor low-rank representation methods can only be applied to the regular data cubes, which limits the performance. To remedy this issue, in this paper we propose a novel irregular tensor low-rank representation model. We first segment the HSI data into several irregular homogeneous regions. Then, we propose a novel irregular tensor low-rank representation method that can efficiently model the irregular 3D cubes. We further use a non-convex nuclear norm to pursue the low-rankness and introduce a negative global low-rank term that improves global consistency. This proposed model is finally formulated as a convex-concave optimization problem and solved by alternative augmented Lagrangian method. Through experiments on four public datasets, the proposed method outperforms the existing low-rank based HSI methods significantly. Code is available at: https://github.com/hb-studying/ITLRR.
Abstract:Label Distribution Learning (LDL) is a novel machine learning paradigm that addresses the problem of label ambiguity and has found widespread applications. Obtaining complete label distributions in real-world scenarios is challenging, which has led to the emergence of Incomplete Label Distribution Learning (InLDL). However, the existing InLDL methods overlook a crucial aspect of LDL data: the inherent imbalance in label distributions. To address this limitation, we propose \textbf{Incomplete and Imbalance Label Distribution Learning (I\(^2\)LDL)}, a framework that simultaneously handles incomplete labels and imbalanced label distributions. Our method decomposes the label distribution matrix into a low-rank component for frequent labels and a sparse component for rare labels, effectively capturing the structure of both head and tail labels. We optimize the model using the Alternating Direction Method of Multipliers (ADMM) and derive generalization error bounds via Rademacher complexity, providing strong theoretical guarantees. Extensive experiments on 15 real-world datasets demonstrate the effectiveness and robustness of our proposed framework compared to existing InLDL methods.
Abstract:In this paper, we introduce the Dependent Noise-based Inaccurate Label Distribution Learning (DN-ILDL) framework to tackle the challenges posed by noise in label distribution learning, which arise from dependencies on instances and labels. We start by modeling the inaccurate label distribution matrix as a combination of the true label distribution and a noise matrix influenced by specific instances and labels. To address this, we develop a linear mapping from instances to their true label distributions, incorporating label correlations, and decompose the noise matrix using feature and label representations, applying group sparsity constraints to accurately capture the noise. Furthermore, we employ graph regularization to align the topological structures of the input and output spaces, ensuring accurate reconstruction of the true label distribution matrix. Utilizing the Alternating Direction Method of Multipliers (ADMM) for efficient optimization, we validate our method's capability to recover true labels accurately and establish a generalization error bound. Extensive experiments demonstrate that DN-ILDL effectively addresses the ILDL problem and outperforms existing LDL methods.
Abstract:Semi-supervised symmetric non-negative matrix factorization (SNMF) utilizes the available supervisory information (usually in the form of pairwise constraints) to improve the clustering ability of SNMF. The previous methods introduce the pairwise constraints from the local perspective, i.e., they either directly refine the similarity matrix element-wisely or restrain the distance of the decomposed vectors in pairs according to the pairwise constraints, which overlook the global perspective, i.e., in the ideal case, the pairwise constraint matrix and the ideal similarity matrix possess the same low-rank structure. To this end, we first propose a novel semi-supervised SNMF model by seeking low-rank representation for the tensor synthesized by the pairwise constraint matrix and a similarity matrix obtained by the product of the embedding matrix and its transpose, which could strengthen those two matrices simultaneously from a global perspective. We then propose an enhanced SNMF model, making the embedding matrix tailored to the above tensor low-rank representation. We finally refine the similarity matrix by the strengthened pairwise constraints. We repeat the above steps to continuously boost the similarity matrix and pairwise constraint matrix, leading to a high-quality embedding matrix. Extensive experiments substantiate the superiority of our method. The code is available at https://github.com/JinaLeejnl/TSNMF.
Abstract:Deep clustering has exhibited remarkable performance; however, the overconfidence problem, i.e., the estimated confidence for a sample belonging to a particular cluster greatly exceeds its actual prediction accuracy, has been overlooked in prior research. To tackle this critical issue, we pioneer the development of a calibrated deep clustering framework. Specifically, we propose a novel dual-head deep clustering pipeline that can effectively calibrate the estimated confidence and the actual accuracy. The calibration head adjusts the overconfident predictions of the clustering head using regularization methods, generating prediction confidence and pseudo-labels that match the model learning status. This calibration process also guides the clustering head in dynamically selecting reliable high-confidence samples for training. Additionally, we introduce an effective network initialization strategy that enhances both training speed and network robustness. Extensive experiments demonstrate the proposed calibrated deep clustering framework not only surpasses state-of-the-art deep clustering methods by approximately 10 times in terms of expected calibration error but also significantly outperforms them in terms of clustering accuracy.
Abstract:Hyperspectral images (HSI) clustering is an important but challenging task. The state-of-the-art (SOTA) methods usually rely on superpixels, however, they do not fully utilize the spatial and spectral information in HSI 3-D structure, and their optimization targets are not clustering-oriented. In this work, we first use 3-D and 2-D hybrid convolutional neural networks to extract the high-order spatial and spectral features of HSI through pre-training, and then design a superpixel graph contrastive clustering (SPGCC) model to learn discriminative superpixel representations. Reasonable augmented views are crucial for contrastive clustering, and conventional contrastive learning may hurt the cluster structure since different samples are pushed away in the embedding space even if they belong to the same class. In SPGCC, we design two semantic-invariant data augmentations for HSI superpixels: pixel sampling augmentation and model weight augmentation. Then sample-level alignment and clustering-center-level contrast are performed for better intra-class similarity and inter-class dissimilarity of superpixel embeddings. We perform clustering and network optimization alternatively. Experimental results on several HSI datasets verify the advantages of the proposed method, e.g., on India Pines, our model improves the clustering accuracy from 58.79% to 67.59% compared to the SOTA method.
Abstract:This paper introduces RankMatch, an innovative approach for Semi-Supervised Label Distribution Learning (SSLDL). Addressing the challenge of limited labeled data, RankMatch effectively utilizes a small number of labeled examples in conjunction with a larger quantity of unlabeled data, reducing the need for extensive manual labeling in Deep Neural Network (DNN) applications. Specifically, RankMatch introduces an ensemble learning-inspired averaging strategy that creates a pseudo-label distribution from multiple weakly augmented images. This not only stabilizes predictions but also enhances the model's robustness. Beyond this, RankMatch integrates a pairwise relevance ranking (PRR) loss, capturing the complex inter-label correlations and ensuring that the predicted label distributions align with the ground truth. We establish a theoretical generalization bound for RankMatch, and through extensive experiments, demonstrate its superiority in performance against existing SSLDL methods.