Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Aug 31, 2022

Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang

Figure 1 for LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Figure 2 for LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Figure 3 for LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Figure 4 for LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Share this with someone who'll enjoy it:

Abstract:In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning, achieving 42.6\% MRR@10 with 45.83 QPS on a CPU machine for the passage retrieval benchmark, MS-Marco. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

* Work in progress

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Paper and Code