Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pierre-Emmanuel Mazaré

FAIR

Inference-time sparse attention with asymmetric indexing

Feb 12, 2025

Pierre-Emmanuel Mazaré, Gergely Szilvasy, Maria Lomeli, Francisco Massa, Naila Murray, Hervé Jégou, Matthijs Douze

Abstract:Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors. One way to speed up self-attention is to employ GPU-compliant vector search algorithms, yet the standard partitioning methods yield poor results in this context, because (1) keys and queries follow different distributions and (2) the effect of RoPE positional encoding. In this paper, we introduce SAAP (Self-Attention with Asymmetric Partitions), which overcomes these problems. It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern. It works on pretrained language models without finetuning, as it only requires to train (offline) a small query classifier. On a long context Llama 3.1-8b model, with sequences ranging from 100k to 500k tokens, our method typically reduces by a factor 20 the fraction of memory that needs to be looked-up, which translates to a time saving of 60\% when compared to FlashAttention-v2.

Via

Access Paper or Ask Questions

Vector search with small radiuses

Mar 16, 2024

Gergely Szilvasy, Pierre-Emmanuel Mazaré, Matthijs Douze

Figure 1 for Vector search with small radiuses

Figure 2 for Vector search with small radiuses

Figure 3 for Vector search with small radiuses

Figure 4 for Vector search with small radiuses

Abstract:In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs to be taken depending on the vector retrieval results, for example, deciding whether a query image matches a database image or not. We solve this as a range search task, where all vectors within a certain radius from the query are returned. We show that the value of a range search result can be modeled rigorously based on the query-to-vector distance. This yields a metric for range search, RSM, that is both principled and easy to compute without running an end-to-end evaluation. We apply this metric to the case of image retrieval. We show that indexing methods that are adapted for top-k retrieval do not necessarily maximize the RSM. In particular, for inverted file based indexes, we show that visiting a limited set of clusters and encoding vectors compactly yields near optimal results.

Via

Access Paper or Ask Questions

The Faiss library

Jan 16, 2024

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, Hervé Jégou

Abstract:Vector databases manage large collections of embedding vectors. As AI applications are growing rapidly, so are the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper first describes the tradeoff space of vector search, then the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability.

Via

Access Paper or Ask Questions

Improving Wikipedia Verifiability with AI

Jul 08, 2022

Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Pierre-Emmanuel Mazaré(+3 more)

Figure 1 for Improving Wikipedia Verifiability with AI

Figure 2 for Improving Wikipedia Verifiability with AI

Figure 3 for Improving Wikipedia Verifiability with AI

Figure 4 for Improving Wikipedia Verifiability with AI

Abstract:Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not support a given claim or become obsolete once the original source is updated or deleted. Hence, maintaining and improving the quality of Wikipedia references is an important challenge and there is a pressing need for better tools to assist humans in this effort. Here, we show that the process of improving references can be tackled with the help of artificial intelligence (AI). We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims, and subsequently recommend better ones from the web. We train this model on existing Wikipedia references, therefore learning from the contributions and combined wisdom of thousands of Wikipedia editors. Using crowd-sourcing, we observe that for the top 10% most likely citations to be tagged as unverifiable by our system, humans prefer our system's suggested alternatives compared to the originally cited reference 70% of the time. To validate the applicability of our system, we built a demo to engage with the English-speaking Wikipedia community and find that Side's first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims according to Side. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia. More generally, we hope that our work can be used to assist fact checking efforts and increase the general trustworthiness of information online.

Via

Access Paper or Ask Questions

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Jul 02, 2020

Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

Figure 1 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Figure 2 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Figure 3 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Figure 4 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Abstract:Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.

Via

Access Paper or Ask Questions

Unsupervised pretraining transfers well across languages

Feb 07, 2020

Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux

Figure 1 for Unsupervised pretraining transfers well across languages

Figure 2 for Unsupervised pretraining transfers well across languages

Figure 3 for Unsupervised pretraining transfers well across languages

Figure 4 for Unsupervised pretraining transfers well across languages

Abstract:Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

* ICASSP 2020
* 6 pages. Accepted at ICASSP 2020. However the 2 pages of supplementary materials will appear only in the arxiv version

Via

Access Paper or Ask Questions

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Dec 17, 2019

Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen(+5 more)

Figure 1 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 2 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 3 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 4 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Abstract:We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Via

Access Paper or Ask Questions

Reference-less Quality Estimation of Text Simplification Systems

Jan 30, 2019

Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, Antoine Bordes, Éric Villemonte de La Clergerie, Benoît Sagot

Figure 1 for Reference-less Quality Estimation of Text Simplification Systems

Figure 2 for Reference-less Quality Estimation of Text Simplification Systems

Figure 3 for Reference-less Quality Estimation of Text Simplification Systems

Figure 4 for Reference-less Quality Estimation of Text Simplification Systems

Abstract:The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between the simplified text and its original version. In this paper, we compare multiple approaches to reference-less quality estimation of sentence-level text simplification systems, based on the dataset used for the QATS 2016 shared task. We distinguish three different dimensions: gram-maticality, meaning preservation and simplicity. We show that n-gram-based MT metrics such as BLEU and METEOR correlate the most with human judgment of grammaticality and meaning preservation, whereas simplicity is best evaluated by basic length-based metrics.

* 1st Workshop on Automatic Text Adaptation (ATA), Nov 2018, Tilburg, Netherlands. https://www.ida.liu.se/~evere22/ATA-18/

Via

Access Paper or Ask Questions

Training Millions of Personalized Dialogue Agents

Sep 06, 2018

Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, Antoine Bordes

Figure 1 for Training Millions of Personalized Dialogue Agents

Figure 2 for Training Millions of Personalized Dialogue Agents

Figure 3 for Training Millions of Personalized Dialogue Agents

Figure 4 for Training Millions of Personalized Dialogue Agents

Abstract:Current dialogue systems are not very engaging for users, especially when trained end-to-end without relying on proactive reengaging scripted strategies. Zhang et al. (2018) showed that the engagement level of end-to-end dialogue models increases when conditioning them on text personas providing some personalized back-story to the model. However, the dataset used in Zhang et al. (2018) is synthetic and of limited size as it contains around 1k different personas. In this paper we introduce a new dataset providing 5 million personas and 700 million persona-based dialogues. Our experiments show that, at this scale, training using personas still improves the performance of end-to-end systems. In addition, we show that other tasks benefit from the wide coverage of our dataset by fine-tuning our model on the data from Zhang et al. (2018) and achieving state-of-the-art results.

* EMNLP 2018

Via

Access Paper or Ask Questions

Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading

Apr 27, 2018

Martin Raison, Pierre-Emmanuel Mazaré, Rajarshi Das, Antoine Bordes

Figure 1 for Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading

Figure 2 for Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading

Figure 3 for Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading

Figure 4 for Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading

Abstract:This paper aims at improving how machines can answer questions directly from text, with the focus of having models that can answer correctly multiple types of questions and from various types of texts, documents or even from large collections of them. To that end, we introduce the Weaver model that uses a new way to relate a question to a textual context by weaving layers of recurrent networks, with the goal of making as few assumptions as possible as to how the information from both question and context should be combined to form the answer. We show empirically on six datasets that Weaver performs well in multiple conditions. For instance, it produces solid results on the very popular SQuAD dataset (Rajpurkar et al., 2016), solves almost all bAbI tasks (Weston et al., 2015) and greatly outperforms state-of-the-art methods for open domain question answering from text (Chen et al., 2017).

Via

Access Paper or Ask Questions