Abstract:This study analyzes changes in the attention mechanisms of large language models (LLMs) when used to understand natural conversations between humans (human-human). We analyze three use cases of LLMs: interactions over web content, code, and mathematical texts. By analyzing attention distance, dispersion, and interdependency across these domains, we highlight the unique challenges posed by conversational data. Notably, conversations require nuanced handling of long-term contextual relationships and exhibit higher complexity through their attention patterns. Our findings reveal that while language models exhibit domain-specific attention behaviors, there is a significant gap in their ability to specialize in human conversations. Through detailed attention entropy analysis and t-SNE visualizations, we demonstrate the need for models trained with a diverse array of high-quality conversational data to enhance understanding and generation of human-like dialogue. This research highlights the importance of domain specialization in language models and suggests pathways for future advancement in modeling human conversational nuances.
Abstract:In recent times, contrastive learning based loss functions have become increasingly popular for visual self-supervised representation learning owing to their state-of-the-art (SOTA) performance. Most of the modern contrastive learning loss functions like SimCLR are Info-NCE based and generalize only to one positive and multiple negatives per anchor. A recent state-of-the-art, supervised contrastive (SupCon) loss, extends self-supervised contrastive learning to supervised setting by generalizing to multiple positives and multiple negatives in a batch and improves upon the cross-entropy loss. In this paper, we propose a novel contrastive loss function - Tuned Contrastive Learning (TCL) loss, that generalizes to multiple positives and multiple negatives within a batch and offers parameters to tune and improve the gradient responses from hard positives and hard negatives. We provide theoretical analysis of our loss function's gradient response and show mathematically how it is better than that of SupCon loss. Empirically, we compare our loss function with SupCon loss and cross-entropy loss in a supervised setting on multiple classification-task datasets. We also show the stability of our loss function to various hyper-parameter settings. Finally, we compare TCL with various SOTA self-supervised learning methods and show that our loss function achieves performance on par with SOTA methods in both supervised and self-supervised settings.