Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martin Franz

Granite Embedding Models

Feb 27, 2025

Parul Awasthy, Aashka Trivedi, Yulong Li, Mihaela Bornea, David Cox, Abraham Daniels, Martin Franz, Gabe Goodhart, Bhavani Iyer, Vishwajeet Kumar(+12 more)

Abstract:We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite.

Via

Access Paper or Ask Questions

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Mar 01, 2023

Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, Salim Roukos, Avirup Sil, Md Arafat Sultan, Christopher Potts

Figure 1 for UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Figure 2 for UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Figure 3 for UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Figure 4 for UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Abstract:Many information retrieval tasks require large labeled datasets for fine-tuning. However, such datasets are often unavailable, and their utility for real-world applications can diminish quickly due to domain shifts. To address this challenge, we develop and motivate a method for using large language models (LLMs) to generate large numbers of synthetic queries cheaply. The method begins by generating a small number of synthetic queries using an expensive LLM. After that, a much less expensive one is used to create large numbers of synthetic queries, which are used to fine-tune a family of reranker models. These rerankers are then distilled into a single efficient retriever for use in the target domain. We show that this technique boosts zero-shot accuracy in long-tail domains, even where only 2K synthetic queries are used for fine-tuning, and that it achieves substantially lower latency than standard reranking methods. We make our end-to-end approach, including our synthetic datasets and replication code, publicly available on Github.

Via

Access Paper or Ask Questions

PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Jan 25, 2023

Avirup Sil, Jaydeep Sen, Bhavani Iyer, Martin Franz, Kshitij Fadnis, Mihaela Bornea, Sara Rosenthal, Scott McCarley, Rong Zhang, Vishwajeet Kumar(+5 more)

Figure 1 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Figure 2 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Figure 3 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Figure 4 for PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Abstract:The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.

Via

Access Paper or Ask Questions

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

Dec 02, 2022

Keshav Santhanam, Jon Saad-Falcon, Martin Franz, Omar Khattab, Avirup Sil, Radu Florian, Md Arafat Sultan, Salim Roukos, Matei Zaharia, Christopher Potts

Abstract:Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.

Via

Access Paper or Ask Questions

Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

Apr 24, 2022

Revanth Gangi Reddy, Md Arafat Sultan, Martin Franz, Avirup Sil, Heng Ji

Figure 1 for Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

Figure 2 for Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

Figure 3 for Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

Figure 4 for Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

Abstract:We show that supervised neural information retrieval (IR) models are prone to learning sparse attention patterns over passage tokens, which can result in key phrases including named entities receiving low attention weights, eventually leading to model under-performance. Using a novel targeted synthetic data generation method that identifies poorly attended entities and conditions the generation episodes on those, we teach neural IR to attend more uniformly and robustly to all entities in a given passage. On two public IR benchmarks, we empirically show that the proposed method helps improve both the model's attention patterns and retrieval performance, including in zero-shot settings.

* Published at SIGIR 2022

Via

Access Paper or Ask Questions

Learning Cross-Lingual IR from an English Retriever

Dec 15, 2021

Yulong Li, Martin Franz, Md Arafat Sultan, Bhavani Iyer, Young-Suk Lee, Avirup Sil

Figure 1 for Learning Cross-Lingual IR from an English Retriever

Figure 2 for Learning Cross-Lingual IR from an English Retriever

Figure 3 for Learning Cross-Lingual IR from an English Retriever

Figure 4 for Learning Cross-Lingual IR from an English Retriever

Abstract:We present a new cross-lingual information retrieval (CLIR) model trained using multi-stage knowledge distillation (KD). The teacher and the student are heterogeneous systems-the former is a pipeline that relies on machine translation and monolingual IR, while the latter executes a single CLIR operation. We show that the student can learn both multilingual representations and CLIR by optimizing two corresponding KD objectives. Learning multilingual representations from an English-only retriever is accomplished using a novel cross-lingual alignment algorithm that greedily re-positions the teacher tokens for alignment. Evaluation on the XOR-TyDi benchmark shows that the proposed model is far more effective than the existing approach of fine-tuning with cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.

* 6 pages

Via

Access Paper or Ask Questions

Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Apr 15, 2021

Revanth Gangi Reddy, Vikas Yadav, Md Arafat Sultan, Martin Franz, Vittorio Castelli, Heng Ji, Avirup Sil

Figure 1 for Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Figure 2 for Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Figure 3 for Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Figure 4 for Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Abstract:Recent work has shown that commonly available machine reading comprehension (MRC) datasets can be used to train high-performance neural information retrieval (IR) systems. However, the evaluation of neural IR has so far been limited to standard supervised learning settings, where they have outperformed traditional term matching baselines. We conduct in-domain and out-of-domain evaluations of neural IR, and seek to improve its robustness across different scenarios, including zero-shot settings. We show that synthetic training examples generated using a sequence-to-sequence generator can be effective towards this goal: in our experiments, pre-training with synthetic examples improves retrieval performance in both in-domain and out-of-domain evaluation on five different test sets.

Via

Access Paper or Ask Questions

The TechQA Dataset

Nov 08, 2019

Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh Garg, Dinesh Khandelwal, Scott McCarley, Mike McCawley(+11 more)

Abstract:We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 dev, and 490 evaluation question/answer pairs -- thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TechQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote---a technical document that addresses a specific technical issue. We also release a collection of the 801,998 publicly available Technotes as of April 4, 2019 as a companion resource that might be used for pretraining, to learn representations of the IT domain language.

* Long version of conference paper to be submitted

Via

Access Paper or Ask Questions