Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Dec 18, 2023

Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh(+1 more)

Figure 1 for NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Figure 2 for NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Figure 3 for NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Figure 4 for NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Share this with someone who'll enjoy it:

Abstract:Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior works lack a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages manually judged as non-relevant or noisy, whereas queries in the relevant subset include at least a single judged relevant passage. We measure LLM robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset. We build a GPT-4 baseline which achieves a 33.2% hallucination rate on the non-relevant and a 14.9% error rate on the relevant subset on average. Our evaluation reveals that GPT-4 hallucinates frequently in high-resource languages, such as French or English. This work highlights an important avenue for future research to improve LLM robustness to learn how to better reject non-relevant information in RAG.

View paper on

Share this with someone who'll enjoy it:

Title:NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Paper and Code