Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sara Rosenthal

Granite Embedding Models

Feb 27, 2025

Parul Awasthy, Aashka Trivedi, Yulong Li, Mihaela Bornea, David Cox, Abraham Daniels, Martin Franz, Gabe Goodhart, Bhavani Iyer, Vishwajeet Kumar(+12 more)

Abstract:We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite.

Via

Access Paper or Ask Questions

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Jan 07, 2025

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina Danilevsky

Figure 1 for MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Figure 2 for MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Figure 3 for MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Figure 4 for MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Abstract:Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.

Via

Access Paper or Ask Questions

CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

Apr 02, 2024

Sara Rosenthal, Avirup Sil, Radu Florian, Salim Roukos

Abstract:Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at https://github.com/primeqa/clapnq

* 25 pages

Via

Access Paper or Ask Questions

Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Dec 18, 2023

Christoph Tillmann, Aashka Trivedi, Sara Rosenthal, Santosh Borse, Rong Zhang, Avirup Sil, Bishwaranjan Bhattacharjee

Figure 1 for Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Figure 2 for Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Figure 3 for Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Figure 4 for Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Abstract:Offensive language such as hate, abuse, and profanity (HAP) occurs in various content on the web. While previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. We build upon this work and introduce Muted, a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity. Muted can leverage any transformer-based HAP-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. In addition, we use the spaCy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. We present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text. Finally, we demonstrate our proposed visualization tool on multilingual inputs.

* EMNLP 2023 Demo Track

Via

Access Paper or Ask Questions

PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Jan 25, 2023

Avirup Sil, Jaydeep Sen, Bhavani Iyer, Martin Franz, Kshitij Fadnis, Mihaela Bornea, Sara Rosenthal, Scott McCarley, Rong Zhang, Vishwajeet Kumar(+5 more)

Figure 1 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Figure 2 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Figure 3 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Figure 4 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Abstract:The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.

Via

Access Paper or Ask Questions

GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Questions

Jun 21, 2022

Scott McCarley, Mihaela Bornea, Sara Rosenthal, Anthony Ferritto, Md Arafat Sultan, Avirup Sil, Radu Florian

Figure 1 for GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Questions

Figure 2 for GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Questions

Figure 3 for GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Questions

Figure 4 for GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Questions

Abstract:Recent machine reading comprehension datasets include extractive and boolean questions but current approaches do not offer integrated support for answering both question types. We present a multilingual machine reading comprehension system and front-end demo that handles boolean questions by providing both a YES/NO answer and highlighting supporting evidence, and handles extractive questions by highlighting the answer in the passage. Our system, GAAMA 2.0, is ranked first on the Tydi QA leaderboard at the time of this writing. We contrast two different implementations of our approach. The first includes several independent stacks of transformers allowing easy deployment of each component. The second is a single stack of transformers utilizing adapters to reduce GPU memory footprint in a resource-constrained environment.

Via

Access Paper or Ask Questions

Task Transfer and Domain Adaptation for Zero-Shot Question Answering

Jun 14, 2022

Xiang Pan, Alex Sheng, David Shimshoni, Aditya Singhal, Sara Rosenthal, Avirup Sil

Figure 1 for Task Transfer and Domain Adaptation for Zero-Shot Question Answering

Figure 2 for Task Transfer and Domain Adaptation for Zero-Shot Question Answering

Figure 3 for Task Transfer and Domain Adaptation for Zero-Shot Question Answering

Figure 4 for Task Transfer and Domain Adaptation for Zero-Shot Question Answering

Abstract:Pretrained language models have shown success in various areas of natural language processing, including reading comprehension tasks. However, when applying machine learning methods to new domains, labeled data may not always be available. To address this, we use supervised pretraining on source-domain data to reduce sample complexity on domain-specific downstream tasks. We evaluate zero-shot performance on domain-specific reading comprehension tasks by combining task transfer with domain adaptation to fine-tune a pretrained model with no labelled data from the target task. Our approach outperforms Domain-Adaptive Pretraining on downstream domain-specific reading comprehension tasks in 3 out of 4 domains.

* NAACL 2022 Deep Learning for Low-Resource NLP Workshop Paper

Via

Access Paper or Ask Questions

Do Answers to Boolean Questions Need Explanations? Yes

Dec 14, 2021

Sara Rosenthal, Mihaela Bornea, Avirup Sil, Radu Florian, Scott McCarley

Figure 1 for Do Answers to Boolean Questions Need Explanations? Yes

Figure 2 for Do Answers to Boolean Questions Need Explanations? Yes

Figure 3 for Do Answers to Boolean Questions Need Explanations? Yes

Figure 4 for Do Answers to Boolean Questions Need Explanations? Yes

Abstract:Existing datasets that contain boolean questions, such as BoolQ and TYDI QA , provide the user with a YES/NO response to the question. However, a one word response is not sufficient for an explainable system. We promote explainability by releasing a new set of annotations marking the evidence in existing TyDi QA and BoolQ datasets. We show that our annotations can be used to train a model that extracts improved evidence spans compared to models that rely on existing resources. We confirm our findings with a user study which shows that our extracted evidence spans enhance the user experience. We also provide further insight into the challenges of answering boolean questions, such as passages containing conflicting YES and NO answers, and varying degrees of relevance of the predicted evidence.

* 9 pages

Via

Access Paper or Ask Questions

Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering

Apr 15, 2021

Sara Rosenthal, Mihaela Bornea, Avirup Sil

Figure 1 for Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering

Figure 2 for Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering

Figure 3 for Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering

Figure 4 for Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering

Abstract:Recent approaches have exploited weaknesses in monolingual question answering (QA) models by adding adversarial statements to the passage. These attacks caused a reduction in state-of-the-art performance by almost 50%. In this paper, we are the first to explore and successfully attack a multilingual QA (MLQA) system pre-trained on multilingual BERT using several attack strategies for the adversarial statement reducing performance by as much as 85%. We show that the model gives priority to English and the language of the question regardless of the other languages in the QA pair. Further, we also show that adding our attack strategies during training helps alleviate the attacks.

Via

Access Paper or Ask Questions

Multilingual Transfer Learning for QA Using Translation as Data Augmentation

Dec 10, 2020

Mihaela Bornea, Lin Pan, Sara Rosenthal, Radu Florian, Avirup Sil

Figure 1 for Multilingual Transfer Learning for QA Using Translation as Data Augmentation

Figure 2 for Multilingual Transfer Learning for QA Using Translation as Data Augmentation

Figure 3 for Multilingual Transfer Learning for QA Using Translation as Data Augmentation

Figure 4 for Multilingual Transfer Learning for QA Using Translation as Data Augmentation

Abstract:Prior work on multilingual question answering has mostly focused on using large multilingual pre-trained language models (LM) to perform zero-shot language-wise learning: train a QA model on English and test on other languages. In this work, we explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space. Our first strategy augments the original English training data with machine translation-generated data. This results in a corpus of multilingual silver-labeled QA pairs that is 14 times larger than the original training set. In addition, we propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance and result in LM embeddings that are less language-variant. Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.

* AAAI 2021

Via

Access Paper or Ask Questions