Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Oct 10, 2024

Qingni Wang, Tiantian Geng, Zhiyuan Wang, Teng Wang, Bo Fu, Feng Zheng

Figure 1 for Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Figure 2 for Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Figure 3 for Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Figure 4 for Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Share this with someone who'll enjoy it:

Abstract:Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, which hampers their generalizability and adaptability in dynamic, open-ended environments. In this paper, we introduce TRON, a two-step framework for risk control and assessment, applicable to any MLLM that supports sampling in both open-ended and closed-ended scenarios. TRON comprises two main components: (1) a novel conformal score to sample response sets of minimum size, and (2) a nonconformity score to identify high-quality responses based on self-consistency theory, controlling the error rates by two specific risk levels. Furthermore, we investigate semantic redundancy in prediction sets within open-ended contexts for the first time, leading to a promising evaluation metric for MLLMs based on average set size. Our comprehensive experiments across four Video Question-Answering (VideoQA) datasets utilizing eight MLLMs show that TRON achieves desired error rates bounded by two user-specified risk levels. Additionally, deduplicated prediction sets maintain adaptiveness while being more efficient and stable for risk assessment under different risk levels.

* 15 pages, 6 figures

View paper on

Share this with someone who'll enjoy it:

Title:Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

Paper and Code