Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chaochen Gao

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

May 22, 2025

Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu

Abstract:High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

Via

Access Paper or Ask Questions

Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model

May 30, 2024

Chaochen Gao, Xing Wu, Qi Fu, Songlin Hu

Abstract:Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose a Query-centric data synthesis method, abbreviated as Quest. Quest is an interpretable method based on the observation that documents retrieved by similar queries are relevant but low-redundant, thus well-suited for synthesizing long-context data. The method is also scalable and capable of constructing large amounts of long-context data. Using Quest, we synthesize a long-context dataset up to 128k context length, significantly outperforming other data synthesis methods on multiple long-context benchmark datasets. In addition, we further verify that the Quest method is predictable through scaling law experiments, making it a reliable solution for advancing long-context models.

Via

Access Paper or Ask Questions

RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Oct 13, 2022

Xing Wu, Chaochen Gao, Zijia Lin, Zhongyuan Wang, Jizhong Han, Songlin Hu

Figure 1 for RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Figure 2 for RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Figure 3 for RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Figure 4 for RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Abstract:Video language pre-training methods have mainly adopted sparse sampling techniques to alleviate the temporal redundancy of videos. Though effective, sparse sampling still suffers inter-modal redundancy: visual redundancy and textual redundancy. Compared with highly generalized text, sparsely sampled frames usually contain text-independent portions, called visual redundancy. Sparse sampling is also likely to miss important frames corresponding to some text portions, resulting in textual redundancy. Inter-modal redundancy leads to a mismatch of video and text information, hindering the model from better learning the shared semantics across modalities. To alleviate it, we propose Redundancy-aware Video-language Pre-training. We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity. Then, we penalize the highredundant video patches and text tokens through a proposed redundancy-aware contrastive learning. We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC, achieving a significant improvement over the previous stateof-the-art results. Our code are available at https://github.com/caskcsg/VLP/tree/main/RaP.

* EMNLP 2022

Via

Access Paper or Ask Questions

InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Oct 08, 2022

Xing Wu, Chaochen Gao, Zijia Lin, Jizhong Han, Zhongyuan Wang, Songlin Hu

Figure 1 for InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Figure 2 for InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Figure 3 for InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Figure 4 for InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Abstract:Contrastive learning has been extensively studied in sentence embedding learning, which assumes that the embeddings of different views of the same sentence are closer. The constraint brought by this assumption is weak, and a good sentence representation should also be able to reconstruct the original sentence fragments. Therefore, this paper proposes an information-aggregated contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE. InfoCSE forces the representation of [CLS] positions to aggregate denser sentence information by introducing an additional Masked language model task and a well-designed network. We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large, achieving state-of-the-art results among unsupervised sentence representation learning methods. Our code are available at https://github.com/caskcsg/sentemb/info

* EMNLP 2022

Via

Access Paper or Ask Questions

Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Feb 28, 2022

Xing Wu, Chaochen Gao, Meng Lin, Liangjun Zang, Zhongyuan Wang, Songlin Hu

Figure 1 for Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Figure 2 for Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Figure 3 for Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Figure 4 for Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Abstract:Before entering the neural network, a token is generally converted to the corresponding one-hot representation, which is a discrete distribution of the vocabulary. Smoothed representation is the probability of candidate tokens obtained from a pre-trained masked language model, which can be seen as a more informative substitution to the one-hot representation. We propose an efficient data augmentation method, termed text smoothing, by converting a sentence from its one-hot representation to a controllable smoothed representation. We evaluate text smoothing on different benchmarks in a low-resource regime. Experimental results show that text smoothing outperforms various mainstream data augmentation methods by a substantial margin. Moreover, text smoothing can be combined with those data augmentation methods to achieve better performance.

* ACL 2022 Main Conference Accepted

Via

Access Paper or Ask Questions

DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings

Dec 10, 2021

Xing Wu, Chaochen Gao, Jue Wang, Liangjun Zang, Zhongyuan Wang, Songlin Hu

Figure 1 for DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings

Figure 2 for DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings

Figure 3 for DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings

Figure 4 for DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings

Abstract:Contrastive learning has been proven suitable for learning sentence embeddings and can significantly improve the semantic textual similarity (STS) tasks. Recently, large contrastive learning models, e.g., Sentence-T5, tend to be proposed to learn more powerful sentence embeddings. Though effective, such large models are hard to serve online due to computational resources or time cost limits. To tackle that, knowledge distillation (KD) is commonly adopted, which can compress a large "teacher" model into a small "student" model but generally suffer from some performance loss. Here we propose an enhanced KD framework termed Distill-Contrast (DisCo). The proposed DisCo framework firstly utilizes KD to transfer the capability of a large sentence embedding model to a small student model on large unlabelled data, and then finetunes the student model with contrastive learning on labelled training data. For the KD process in DisCo, we further propose Contrastive Knowledge Distillation (CKD) to enhance the consistencies among teacher model training, KD, and student model finetuning, which can probably improve performance like prompt learning. Extensive experiments on 7 STS benchmarks show that student models trained with the proposed DisCo and CKD suffer from little or even no performance loss and consistently outperform the corresponding counterparts of the same parameter size. Amazingly, our 110M student model can even outperform the latest state-of-the-art (SOTA) model, i.e., Sentence-T5(11B), with only 1% parameters.

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

TransAug: Translate as Augmentation for Sentence Embeddings

Oct 30, 2021

Jue Wang, Haofan Wang, Xing Wu, Chaochen Gao, Debing Zhang

Figure 1 for TransAug: Translate as Augmentation for Sentence Embeddings

Figure 2 for TransAug: Translate as Augmentation for Sentence Embeddings

Figure 3 for TransAug: Translate as Augmentation for Sentence Embeddings

Figure 4 for TransAug: Translate as Augmentation for Sentence Embeddings

Abstract:While contrastive learning greatly advances the representation of sentence embeddings, it is still limited by the size of the existing sentence datasets. In this paper, we present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduce a two-stage paradigm to advances the state-of-the-art sentence embeddings. Instead of adopting an encoder trained in other languages setting, we first distill a Chinese encoder from a SimCSE encoder (pretrained in English), so that their embeddings are close in semantic space, which can be regraded as implicit data augmentation. Then, we only update the English encoder via cross-lingual contrastive learning and frozen the distilled Chinese encoder. Our approach achieves a new state-of-art on standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks evaluated by SentEval.

Via

Access Paper or Ask Questions

ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding

Sep 09, 2021

Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang, Songlin Hu

Figure 1 for ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding

Figure 2 for ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding

Figure 3 for ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding

Figure 4 for ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding

Abstract:Contrastive learning has been attracting much attention for learning unsupervised sentence embeddings. The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as a minimal data augmentation method, and passes the same input sentence to a pre-trained Transformer encoder (with dropout turned on) twice to obtain the two corresponding embeddings to build a positive pair. As the length information of a sentence will generally be encoded into the sentence embeddings due to the usage of position embedding in Transformer, each positive pair in unsup-SimCSE actually contains the same length information. And thus unsup-SimCSE trained with these positive pairs is probably biased, which would tend to consider that sentences of the same or similar length are more similar in semantics. Through statistical observations, we find that unsup-SimCSE does have such a problem. To alleviate it, we apply a simple repetition operation to modify the input sentence, and then pass the input sentence and its modified counterpart to the pre-trained Transformer encoder, respectively, to get the positive pair. Additionally, we draw inspiration from the community of computer vision and introduce a momentum contrast, enlarging the number of negative pairs without additional calculations. The proposed two modifications are applied on positive and negative pairs separately, and build a new sentence embedding method, termed Enhanced Unsup-SimCSE (ESimCSE). We evaluate the proposed ESimCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that ESimCSE outperforms the state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on BERT-base.

* 9 pages, 2 figures

Via

Access Paper or Ask Questions

Smoothed Contrastive Learning for Unsupervised Sentence Embedding

Sep 09, 2021

Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang, Songlin Hu

Figure 1 for Smoothed Contrastive Learning for Unsupervised Sentence Embedding

Figure 2 for Smoothed Contrastive Learning for Unsupervised Sentence Embedding

Figure 3 for Smoothed Contrastive Learning for Unsupervised Sentence Embedding

Figure 4 for Smoothed Contrastive Learning for Unsupervised Sentence Embedding

Abstract:Contrastive learning has been gradually applied to learn high-quality unsupervised sentence embedding. Among the previous un-supervised methods, the latest state-of-the-art method, as far as we know, is unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE uses the InfoNCE1loss function in the training stage by pulling semantically similar sentences together and pushing apart dis-similar ones.Theoretically, we expect to use larger batches in unsup-SimCSE to get more adequate comparisons among samples and avoid overfitting. However, increasing the batch size does not always lead to improvements, but instead even lead to performance degradation when the batch size exceeds a threshold. Through statistical observation, we find that this is probably due to the introduction of low-confidence negative pairs after in-creasing the batch size. To alleviate this problem, we introduce a simple smoothing strategy upon the InfoNCE loss function, termedGaussian Smoothing InfoNCE (GS-InfoNCE).Specifically, we add random Gaussian noise vectors as negative samples, which act asa smoothing of the negative sample space.Though being simple, the proposed smooth-ing strategy brings substantial improvements to unsup-SimCSE. We evaluate GS-InfoNCEon the standard semantic text similarity (STS)task. GS-InfoNCE outperforms the state-of-the-art unsup-SimCSE by an average Spear-man correlation of 1.38%, 0.72%, 1.17% and0.28% on the base of BERT-base, BERT-large,RoBERTa-base and RoBERTa-large, respectively.

* 6 pages, 2 figures

Via

Access Paper or Ask Questions