Abstract:The challenges in applying contrastive learning to speaker verification (SV) are that the softmax-based contrastive loss lacks discriminative power and that the hard negative pairs can easily influence learning. To overcome these challenges, we propose a contrastive learning SV framework incorporating an additive angular margin into the supervised contrastive loss. The margin improves the speaker representation's discrimination ability. We introduce a class-aware attention mechanism through which hard negative samples contribute less significantly to the supervised contrastive loss. We also employed a gradient-based multi-objective optimization approach to balance the classification and contrastive loss. Experimental results on CN-Celeb and Voxceleb1 show that this new learning objective can cause the encoder to find an embedding space that exhibits great speaker discrimination across languages.