Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cicero Nogueira dos Santos

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Mar 08, 2024

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser(+659 more)

Abstract:In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

Via

Access Paper or Ask Questions

Memory Augmented Language Models through Mixture of Word Experts

Nov 15, 2023

Cicero Nogueira dos Santos, James Lee-Thorp, Isaac Noble, Chung-Ching Chang, David Uthus

Figure 1 for Memory Augmented Language Models through Mixture of Word Experts

Figure 2 for Memory Augmented Language Models through Mixture of Word Experts

Figure 3 for Memory Augmented Language Models through Mixture of Word Experts

Figure 4 for Memory Augmented Language Models through Mixture of Word Experts

Abstract:Scaling up the number of parameters of language models has proven to be an effective approach to improve performance. For dense models, increasing model size proportionally increases the model's computation footprint. In this work, we seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions and experts. Our proposed approach, dubbed Mixture of Word Experts (MoWE), can be seen as a memory augmented model, where a large set of word-specific experts play the role of a sparse memory. We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks. Additionally, MoWE outperforms regular MoE models on knowledge intensive tasks and has similar performance to more complex memory augmented approaches that often require to invoke custom mechanisms to search the sparse memory.

* 14 pages

Via

Access Paper or Ask Questions

Triggering Multi-Hop Reasoning for Question Answering in Language Models using Soft Prompts and Random Walks

Jun 06, 2023

Kanishka Misra, Cicero Nogueira dos Santos, Siamak Shakeri

Figure 1 for Triggering Multi-Hop Reasoning for Question Answering in Language Models using Soft Prompts and Random Walks

Figure 2 for Triggering Multi-Hop Reasoning for Question Answering in Language Models using Soft Prompts and Random Walks

Figure 3 for Triggering Multi-Hop Reasoning for Question Answering in Language Models using Soft Prompts and Random Walks

Figure 4 for Triggering Multi-Hop Reasoning for Question Answering in Language Models using Soft Prompts and Random Walks

Abstract:Despite readily memorizing world knowledge about entities, pre-trained language models (LMs) struggle to compose together two or more facts to perform multi-hop reasoning in question-answering tasks. In this work, we propose techniques that improve upon this limitation by relying on random walks over structured knowledge graphs. Specifically, we use soft prompts to guide LMs to chain together their encoded knowledge by learning to map multi-hop questions to random walk paths that lead to the answer. Applying our methods on two T5 LMs shows substantial improvements over standard tuning approaches in answering questions that require 2-hop reasoning.

* Findings of ACL 2023

Via

Access Paper or Ask Questions

Knowledge Prompts: Injecting World Knowledge into Language Models through Soft Prompts

Oct 10, 2022

Cicero Nogueira dos Santos, Zhe Dong, Daniel Cer, John Nham, Siamak Shakeri, Jianmo Ni, Yun-hsuan Sung

Figure 1 for Knowledge Prompts: Injecting World Knowledge into Language Models through Soft Prompts

Figure 2 for Knowledge Prompts: Injecting World Knowledge into Language Models through Soft Prompts

Figure 3 for Knowledge Prompts: Injecting World Knowledge into Language Models through Soft Prompts

Figure 4 for Knowledge Prompts: Injecting World Knowledge into Language Models through Soft Prompts

Abstract:Soft prompts have been recently proposed as a tool for adapting large frozen language models (LMs) to new tasks. In this work, we repurpose soft prompts to the task of injecting world knowledge into LMs. We introduce a method to train soft prompts via self-supervised learning on data from knowledge bases. The resulting soft knowledge prompts (KPs) are task independent and work as an external memory of the LMs. We perform qualitative and quantitative experiments and demonstrate that: (1) KPs can effectively model the structure of the training data; (2) KPs can be used to improve the performance of LMs in different knowledge intensive tasks.

Via

Access Paper or Ask Questions

Counterfactual Data Augmentation improves Factuality of Abstractive Summarization

May 25, 2022

Dheeraj Rajagopal, Siamak Shakeri, Cicero Nogueira dos Santos, Eduard Hovy, Chung-Ching Chang

Figure 1 for Counterfactual Data Augmentation improves Factuality of Abstractive Summarization

Figure 2 for Counterfactual Data Augmentation improves Factuality of Abstractive Summarization

Figure 3 for Counterfactual Data Augmentation improves Factuality of Abstractive Summarization

Figure 4 for Counterfactual Data Augmentation improves Factuality of Abstractive Summarization

Abstract:Abstractive summarization systems based on pretrained language models often generate coherent but factually inconsistent sentences. In this paper, we present a counterfactual data augmentation approach where we augment data with perturbed summaries that increase the training data diversity. Specifically, we present three augmentation approaches based on replacing (i) entities from other and the same category and (ii) nouns with their corresponding WordNet hypernyms. We show that augmenting the training data with our approach improves the factual correctness of summaries without significantly affecting the ROUGE score. We show that in two commonly used summarization datasets (CNN/Dailymail and XSum), we improve the factual correctness by about 2.5 points on average

Via

Access Paper or Ask Questions

ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference

Apr 25, 2022

Kai Hui, Honglei Zhuang, Tao Chen, Zhen Qin, Jing Lu, Dara Bahri, Ji Ma, Jai Prakash Gupta, Cicero Nogueira dos Santos, Yi Tay(+1 more)

Figure 1 for ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference

Figure 2 for ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference

Figure 3 for ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference

Figure 4 for ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference

Abstract:State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models.

* Findings of ACL 2022

Via

Access Paper or Ask Questions

Contrastive Fine-tuning Improves Robustness for Neural Rankers

May 27, 2021

Xiaofei Ma, Cicero Nogueira dos Santos, Andrew O. Arnold

Figure 1 for Contrastive Fine-tuning Improves Robustness for Neural Rankers

Figure 2 for Contrastive Fine-tuning Improves Robustness for Neural Rankers

Figure 3 for Contrastive Fine-tuning Improves Robustness for Neural Rankers

Figure 4 for Contrastive Fine-tuning Improves Robustness for Neural Rankers

Abstract:The performance of state-of-the-art neural rankers can deteriorate substantially when exposed to noisy inputs or applied to a new domain. In this paper, we present a novel method for fine-tuning neural rankers that can significantly improve their robustness to out-of-domain data and query perturbations. Specifically, a contrastive loss that compares data points in the representation space is combined with the standard ranking loss during fine-tuning. We use relevance labels to denote similar/dissimilar pairs, which allows the model to learn the underlying matching semantics across different query-document pairs and leads to improved robustness. In experiments with four passage ranking datasets, the proposed contrastive fine-tuning method obtains improvements on robustness to query reformulations, noise perturbations, and zero-shot transfer for both BERT and BART based rankers. Additionally, our experiments show that contrastive fine-tuning outperforms data augmentation for robustifying neural rankers.

* Findings of ACL 2021

Via

Access Paper or Ask Questions

Joint Text and Label Generation for Spoken Language Understanding

May 11, 2021

Yang Li, Ben Athiwaratkun, Cicero Nogueira dos Santos, Bing Xiang

Figure 1 for Joint Text and Label Generation for Spoken Language Understanding

Figure 2 for Joint Text and Label Generation for Spoken Language Understanding

Figure 3 for Joint Text and Label Generation for Spoken Language Understanding

Figure 4 for Joint Text and Label Generation for Spoken Language Understanding

Abstract:Generalization is a central problem in machine learning, especially when data is limited. Using prior information to enforce constraints is the principled way of encouraging generalization. In this work, we propose to leverage the prior information embedded in pretrained language models (LM) to improve generalization for intent classification and slot labeling tasks with limited training data. Specifically, we extract prior knowledge from pretrained LM in the form of synthetic data, which encode the prior implicitly. We fine-tune the LM to generate an augmented language, which contains not only text but also encodes both intent labels and slot labels. The generated synthetic data can be used to train a classifier later. Since the generated data may contain noise, we rephrase the learning from generated data as learning with noisy labels. We then utilize the mixout regularization for the classifier and prove its effectiveness to resist label noise in generated data. Empirically, our method demonstrates superior performance and outperforms the baseline by a large margin.

Via

Access Paper or Ask Questions

Improving Factual Consistency of Abstractive Summarization via Question Answering

May 10, 2021

Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu, Patrick Ng, Kathleen McKeown, Ramesh Nallapati, Dejiao Zhang, Zhiguo Wang, Andrew O. Arnold, Bing Xiang

Figure 1 for Improving Factual Consistency of Abstractive Summarization via Question Answering

Figure 2 for Improving Factual Consistency of Abstractive Summarization via Question Answering

Figure 3 for Improving Factual Consistency of Abstractive Summarization via Question Answering

Figure 4 for Improving Factual Consistency of Abstractive Summarization via Question Answering

Abstract:A commonly observed problem with the state-of-the art abstractive summarization models is that the generated summaries can be factually inconsistent with the input documents. The fact that automatic summarization may produce plausible-sounding yet inaccurate summaries is a major concern that limits its wide application. In this paper we present an approach to address factual consistency in summarization. We first propose an efficient automatic evaluation metric to measure factual consistency; next, we propose a novel learning algorithm that maximizes the proposed metric during model training. Through extensive experiments, we confirm that our method is effective in improving factual consistency and even overall quality of the summaries, as judged by both automatic metrics and human evaluation.

* ACL-IJCNLP 2021

Via

Access Paper or Ask Questions

Generative Context Pair Selection for Multi-hop Question Answering

Apr 18, 2021

Dheeru Dua, Cicero Nogueira dos Santos, Patrick Ng, Ben Athiwaratkun, Bing Xiang, Matt Gardner, Sameer Singh

Figure 1 for Generative Context Pair Selection for Multi-hop Question Answering

Figure 2 for Generative Context Pair Selection for Multi-hop Question Answering

Figure 3 for Generative Context Pair Selection for Multi-hop Question Answering

Figure 4 for Generative Context Pair Selection for Multi-hop Question Answering

Abstract:Compositional reasoning tasks like multi-hop question answering, require making latent decisions to get the final answer, given a question. However, crowdsourced datasets often capture only a slice of the underlying task distribution, which can induce unanticipated biases in models performing compositional reasoning. Furthermore, discriminatively trained models exploit such biases to get a better held-out performance, without learning the right way to reason, as they do not necessitate paying attention to the question representation (conditioning variable) in its entirety, to estimate the answer likelihood. In this work, we propose a generative context selection model for multi-hop question answering that reasons about how the given question could have been generated given a context pair. While being comparable to the state-of-the-art answering performance, our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set which tests robustness of model's multi-hop reasoning capabilities.

Via

Access Paper or Ask Questions