Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jungang Xu

JTCSE: Joint Tensor-Modulus Constraints and Cross-Attention for Unsupervised Contrastive Learning of Sentence Embeddings

May 05, 2025

Tianyu Zong, Hongzhu Yi, Bingkang Shi, Yuanxiang Wang, Jungang Xu

Abstract:Unsupervised contrastive learning has become a hot research topic in natural language processing. Existing works usually aim at constraining the orientation distribution of the representations of positive and negative samples in the high-dimensional semantic space in contrastive learning, but the semantic representation tensor possesses both modulus and orientation features, and the existing works ignore the modulus feature of the representations and cause insufficient contrastive learning. % Therefore, we firstly propose a training objective that aims at modulus constraints on the semantic representation tensor, to strengthen the alignment between the positive samples in contrastive learning. Therefore, we first propose a training objective that is designed to impose modulus constraints on the semantic representation tensor, to strengthen the alignment between positive samples in contrastive learning. Then, the BERT-like model suffers from the phenomenon of sinking attention, leading to a lack of attention to CLS tokens that aggregate semantic information. In response, we propose a cross-attention structure among the twin-tower ensemble models to enhance the model's attention to CLS token and optimize the quality of CLS Pooling. Combining the above two motivations, we propose a new \textbf{J}oint \textbf{T}ensor representation modulus constraint and \textbf{C}ross-attention unsupervised contrastive learning \textbf{S}entence \textbf{E}mbedding representation framework JTCSE, which we evaluate in seven semantic text similarity computation tasks, and the experimental results show that JTCSE's twin-tower ensemble model and single-tower distillation model outperform the other baselines and become the current SOTA. In addition, we have conducted an extensive zero-shot downstream task evaluation, which shows that JTCSE outperforms other baselines overall on more than 130 tasks.

Via

Access Paper or Ask Questions

TNCSE: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Sentence Embeddings

Mar 17, 2025

Tianyu Zong, Bingkang Shi, Hongzhu Yi, Jungang Xu

Abstract:Unsupervised sentence embedding representation has become a hot research topic in natural language processing. As a tensor, sentence embedding has two critical properties: direction and norm. Existing works have been limited to constraining only the orientation of the samples' representations while ignoring the features of their module lengths. To address this issue, we propose a new training objective that optimizes the training of unsupervised contrastive learning by constraining the module length features between positive samples. We combine the training objective of Tensor's Norm Constraints with ensemble learning to propose a new Sentence Embedding representation framework, TNCSE. We evaluate seven semantic text similarity tasks, and the results show that TNCSE and derived models are the current state-of-the-art approach; in addition, we conduct extensive zero-shot evaluations, and the results show that TNCSE outperforms other baselines.

Via

Access Paper or Ask Questions

Text Data-Centric Image Captioning with Interactive Prompts

Mar 28, 2024

Yiyu Wang, Hao Luo, Jungang Xu, Yingfei Sun, Fan Wang

Figure 1 for Text Data-Centric Image Captioning with Interactive Prompts

Figure 2 for Text Data-Centric Image Captioning with Interactive Prompts

Figure 3 for Text Data-Centric Image Captioning with Interactive Prompts

Figure 4 for Text Data-Centric Image Captioning with Interactive Prompts

Abstract:Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative language models (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap. 1) We consider four different settings which gradually reduce the dependence on paired data. 2) We construct a mapping module driven by multivariate Gaussian distribution to mitigate the modality gap, which is applicable to the above four different settings. 3) We propose a prompt interaction module that can incorporate optional prompt information before generating captions. Extensive experiments show that our TIPCap outperforms other weakly or unsupervised image captioning methods and achieves a new state-of-the-art performance on two widely used datasets, i.e., MS-COCO and Flickr30K.

Via

Access Paper or Ask Questions

End-to-End Transformer Based Model for Image Captioning

Mar 29, 2022

Yiyu Wang, Jungang Xu, Yingfei Sun

Figure 1 for End-to-End Transformer Based Model for Image Captioning

Figure 2 for End-to-End Transformer Based Model for Image Captioning

Figure 3 for End-to-End Transformer Based Model for Image Captioning

Figure 4 for End-to-End Transformer Based Model for Image Captioning

Abstract:CNN-LSTM based architectures have played an important role in image captioning, but limited by the training efficiency and expression ability, researchers began to explore the CNN-Transformer based models and achieved great success. Meanwhile, almost all recent works adopt Faster R-CNN as the backbone encoder to extract region-level features from given images. However, Faster R-CNN needs a pre-training on an additional dataset, which divides the image captioning task into two stages and limits its potential applications. In this paper, we build a pure Transformer-based model, which integrates image captioning into one stage and realizes end-to-end training. Firstly, we adopt SwinTransformer to replace Faster R-CNN as the backbone encoder to extract grid-level features from given images; Then, referring to Transformer, we build a refining encoder and a decoder. The refining encoder refines the grid features by capturing the intra-relationship between them, and the decoder decodes the refined features into captions word by word. Furthermore, in order to increase the interaction between multi-modal (vision and language) features to enhance the modeling capability, we calculate the mean pooling of grid features as the global feature, then introduce it into refining encoder to refine with grid features together, and add a pre-fusion process of refined global feature and generated words in decoder. To validate the effectiveness of our proposed model, we conduct experiments on MSCOCO dataset. The experimental results compared to existing published works demonstrate that our model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models) CIDEr scores on `Karpathy' offline test split and 136.0% (c5) and 138.3% (c40) CIDEr scores on the official online test server. Trained models and source code will be released.

* AAAI 2022

Via

Access Paper or Ask Questions

A Survey on Neural Network Language Models

Jun 13, 2019

Kun Jing, Jungang Xu

Figure 1 for A Survey on Neural Network Language Models

Figure 2 for A Survey on Neural Network Language Models

Figure 3 for A Survey on Neural Network Language Models

Abstract:As the core component of Natural Language Processing (NLP) system, Language Model (LM) can provide word representation and probability indication of word sequences. Neural Network Language Models (NNLMs) overcome the curse of dimensionality and improve the performance of traditional LMs. A survey on NNLMs is performed in this paper. The structure of classic NNLMs is described firstly, and then some major improvements are introduced and analyzed. We summarize and compare corpora and toolkits of NNLMs. Further, some research directions of NNLMs are discussed.

Via

Access Paper or Ask Questions

A Survey on Neural Machine Reading Comprehension

Jun 10, 2019

Boyu Qiu, Xu Chen, Jungang Xu, Yingfei Sun

Figure 1 for A Survey on Neural Machine Reading Comprehension

Figure 2 for A Survey on Neural Machine Reading Comprehension

Figure 3 for A Survey on Neural Machine Reading Comprehension

Abstract:Enabling a machine to read and comprehend the natural language documents so that it can answer some questions remains an elusive challenge. In recent years, the popularity of deep learning and the establishment of large-scale datasets have both promoted the prosperity of Machine Reading Comprehension. This paper aims to present how to utilize the Neural Network to build a Reader and introduce some classic models, analyze what improvements they make. Further, we also point out the defects of existing models and future research directions

Via

Access Paper or Ask Questions

Image Captioning based on Deep Learning Methods: A Survey

May 20, 2019

Yiyu Wang, Jungang Xu, Yingfei Sun, Ben He

Figure 1 for Image Captioning based on Deep Learning Methods: A Survey

Figure 2 for Image Captioning based on Deep Learning Methods: A Survey

Figure 3 for Image Captioning based on Deep Learning Methods: A Survey

Figure 4 for Image Captioning based on Deep Learning Methods: A Survey

Abstract:Image captioning is a challenging task and attracting more and more attention in the field of Artificial Intelligence, and which can be applied to efficient image retrieval, intelligent blind guidance and human-computer interaction, etc. In this paper, we present a survey on advances in image captioning based on Deep Learning methods, including Encoder-Decoder structure, improved methods in Encoder, improved methods in Decoder, and other improvements. Furthermore, we discussed future research directions.

Via

Access Paper or Ask Questions

NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval

Oct 30, 2018

Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, Jungang Xu

Figure 1 for NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval

Figure 2 for NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval

Figure 3 for NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval

Figure 4 for NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval

Abstract:Pseudo-relevance feedback (PRF) is commonly used to boost the performance of traditional information retrieval (IR) models by using top-ranked documents to identify and weight new query terms, thereby reducing the effect of query-document vocabulary mismatches. While neural retrieval models have recently demonstrated strong results for ad-hoc retrieval, combining them with PRF is not straightforward due to incompatibilities between existing PRF approaches and neural architectures. To bridge this gap, we propose an end-to-end neural PRF framework that can be used with existing neural IR models by embedding different neural models as building blocks. Extensive experiments on two standard test collections confirm the effectiveness of the proposed NPRF framework in improving the performance of two state-of-the-art neural IR models.

* Full paper in EMNLP 2018

Via

Access Paper or Ask Questions