Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roman Vashurin

Mohamed bin Zayed University of Artificial Intelligence

CoCoA: A Generalized Approach to Uncertainty Quantification by Integrating Confidence and Consistency of LLM Outputs

Feb 07, 2025

Roman Vashurin, Maiya Goloburda, Preslav Nakov, Artem Shelmanov, Maxim Panov

Abstract:Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompasses a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches and shown impressive performance in various applications. However, they sometimes fail to outperform much simpler baseline methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency that leads to a family of efficient and robust UQ methods. We evaluate our approach across a variety of tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.

Via

Access Paper or Ask Questions

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Jun 21, 2024

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Lyudmila Rvanova, Sergey Petrakov, Alexander Panchenko(+4 more)

Figure 1 for Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Figure 2 for Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Figure 3 for Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Figure 4 for Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Abstract:Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML). The rapid proliferation of large language models (LLMs) has stimulated researchers to seek efficient and effective approaches to UQ in text generation tasks, as in addition to their emerging capabilities, these models have introduced new challenges for building safe applications. As with other ML models, LLMs are prone to make incorrect predictions, ``hallucinate'' by fabricating claims, or simply generate low-quality output for a given input. UQ is a key element in dealing with these challenges. However research to date on UQ methods for LLMs has been fragmented, with disparate evaluation methods. In this work, we tackle this issue by introducing a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques by researchers in various text generation tasks. Our benchmark also supports the assessment of confidence normalization methods in terms of their ability to provide interpretable scores. Using our benchmark, we conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks and shed light on the most promising approaches.

* Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev contributed equally

Via

Access Paper or Ask Questions

LM-Polygraph: Uncertainty Estimation for Language Models

Nov 13, 2023

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov(+2 more)

Figure 1 for LM-Polygraph: Uncertainty Estimation for Language Models

Figure 2 for LM-Polygraph: Uncertainty Estimation for Language Models

Figure 3 for LM-Polygraph: Uncertainty Estimation for Language Models

Figure 4 for LM-Polygraph: Uncertainty Estimation for Language Models

Abstract:Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.

* Accepted at EMNLP-2023

Via

Access Paper or Ask Questions

Embedded Ensembles: Infinite Width Limit and Operating Regimes

Feb 24, 2022

Maksim Velikanov, Roman Kail, Ivan Anokhin, Roman Vashurin, Maxim Panov, Alexey Zaytsev, Dmitry Yarotsky

Figure 1 for Embedded Ensembles: Infinite Width Limit and Operating Regimes

Figure 2 for Embedded Ensembles: Infinite Width Limit and Operating Regimes

Figure 3 for Embedded Ensembles: Infinite Width Limit and Operating Regimes

Figure 4 for Embedded Ensembles: Infinite Width Limit and Operating Regimes

Abstract:A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.

Via

Access Paper or Ask Questions