Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eugene Ilyushin

Uncertainty-Aware Evaluation for Vision-Language Models

Feb 24, 2024

Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, Eugene Ilyushin

Abstract:Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in popularity recently due to their impressive performance in several vision-language tasks. Current evaluation methods, however, overlook an essential component: uncertainty, which is crucial for a comprehensive assessment of VLMs. Addressing this oversight, we present a benchmark incorporating uncertainty quantification into evaluating VLMs. Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question Answering (VQA) task. We examine models on 5 datasets that evaluate various vision-language capabilities. Using conformal prediction as an uncertainty estimation approach, we demonstrate that the models' uncertainty is not aligned with their accuracy. Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs. Our empirical findings also reveal a correlation between model uncertainty and its language model part.

Via

Access Paper or Ask Questions

DIALOG-22 RuATD Generated Text Detection

Jun 16, 2022

Narek Maloyan, Bulat Nutfullin, Eugene Ilyushin

Figure 1 for DIALOG-22 RuATD Generated Text Detection

Figure 2 for DIALOG-22 RuATD Generated Text Detection

Figure 3 for DIALOG-22 RuATD Generated Text Detection

Figure 4 for DIALOG-22 RuATD Generated Text Detection

Abstract:Text Generation Models (TGMs) succeed in creating text that matches human language style reasonably well. Detectors that can distinguish between TGM-generated text and human-written ones play an important role in preventing abuse of TGM. In this paper, we describe our pipeline for the two DIALOG-22 RuATD tasks: detecting generated text (binary task) and classification of which model was used to generate text (multiclass task). We achieved 1st place on the binary classification task with an accuracy score of 0.82995 on the private test set and 4th place on the multiclass classification task with an accuracy score of 0.62856 on the private test set. We proposed an ensemble method of different pre-trained models based on the attention mechanism.

* 6 pages

Via

Access Paper or Ask Questions