Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eric Slyman

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Sep 10, 2025

Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Abstract:Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge's true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.

* 17 pages, 8 figures, Accepted at ICCV 2025

Via

Access Paper or Ask Questions

Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks

Dec 03, 2024

Zijiao Yang, Xiangxi Shi, Eric Slyman, Stefan Lee

Figure 1 for Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks

Figure 2 for Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks

Figure 3 for Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks

Figure 4 for Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks

Abstract:Assistive embodied agents that can be instructed in natural language to perform tasks in open-world environments have the potential to significantly impact labor tasks like manufacturing or in-home care -- benefiting the lives of those who come to depend on them. In this work, we consider how this benefit might be hijacked by local modifications in the appearance of the agent's operating environment. Specifically, we take the popular Vision-and-Language Navigation (VLN) task as a representative setting and develop a whitebox adversarial attack that optimizes a 3D attack object's appearance to induce desired behaviors in pretrained VLN agents that observe it in the environment. We demonstrate that the proposed attack can cause VLN agents to ignore their instructions and execute alternative actions after encountering the attack object -- even for instructions and agent paths not considered when optimizing the attack. For these novel settings, we find our attacks can induce early-termination behaviors or divert an agent along an attacker-defined multi-step trajectory. Under both conditions, environmental attacks significantly reduce agent capabilities to successfully follow user instructions.

* Accepted by WACV 2025

Via

Access Paper or Ask Questions

You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models

Oct 26, 2024

Eric Slyman, Anirudh Kanneganti, Sanghyun Hong, Stefan Lee

Abstract:We study the impact of a standard practice in compressing foundation vision-language models - quantization - on the models' ability to produce socially-fair outputs. In contrast to prior findings with unimodal models that compression consistently amplifies social biases, our extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, we find no consistent change in bias magnitude or direction across a population of compressed models due to quantization.

* Workshop paper at NeurIPS 2024 RBFM. 6 pages, 3 figures

Via

Access Paper or Ask Questions

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Apr 24, 2024

Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

Figure 1 for FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Figure 2 for FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Figure 3 for FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Figure 4 for FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Abstract:Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.

* Conference paper at CVPR 2024. 6 pages, 8 figures. Project Page: https://ericslyman.com/fairdedup/

Via

Access Paper or Ask Questions

On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

Nov 09, 2023

Daniel Claborne, Eric Slyman, Karl Pazdernik

Figure 1 for On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

Figure 2 for On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

Figure 3 for On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

Figure 4 for On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

Abstract:We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.

Via

Access Paper or Ask Questions

VLSlice: Interactive Vision-and-Language Slice Discovery

Sep 13, 2023

Eric Slyman, Minsuk Kahng, Stefan Lee

Figure 1 for VLSlice: Interactive Vision-and-Language Slice Discovery

Figure 2 for VLSlice: Interactive Vision-and-Language Slice Discovery

Figure 3 for VLSlice: Interactive Vision-and-Language Slice Discovery

Figure 4 for VLSlice: Interactive Vision-and-Language Slice Discovery

Abstract:Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks. While this may improve dataset-scale aggregate metrics, analyzing performance around hand-crafted subgroups targeting specific bias dimensions reveals systemic undesirable behaviors. However, this subgroup analysis is frequently stalled by annotation efforts, which require extensive time and resources to collect the necessary data. Prior art attempts to automatically discover subgroups to circumvent these constraints but typically leverages model behavior on existing task-specific annotations and rapidly degrades on more complex inputs beyond "tabular" data, none of which study vision-and-language models. This paper presents VLSlice, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study (n=22) and release the tool publicly.

* Conference paper at ICCV 2023. 17 pages, 11 figures. https://ericslyman.com/vlslice/

Via

Access Paper or Ask Questions

Fine-Grained Classroom Activity Detection from Audio with Neural Networks

Jul 29, 2021

Eric Slyman, Chris Daw, Morgan Skrabut, Ana Usenko, Brian Hutchinson

Figure 1 for Fine-Grained Classroom Activity Detection from Audio with Neural Networks

Figure 2 for Fine-Grained Classroom Activity Detection from Audio with Neural Networks

Figure 3 for Fine-Grained Classroom Activity Detection from Audio with Neural Networks

Figure 4 for Fine-Grained Classroom Activity Detection from Audio with Neural Networks

Abstract:Instructors are increasingly incorporating student-centered learning techniques in their classrooms to improve learning outcomes. In addition to lecture, these class sessions involve forms of individual and group work, and greater rates of student-instructor interaction. Quantifying classroom activity is a key element of accelerating the evaluation and refinement of innovative teaching practices, but manual annotation does not scale. In this manuscript, we present advances to the young application area of automatic classroom activity detection from audio. Using a university classroom corpus with nine activity labels (e.g., "lecture," "group work," "student question"), we propose and evaluate deep fully connected, convolutional, and recurrent neural network architectures, comparing the performance of mel-filterbank, OpenSmile, and self-supervised acoustic features. We compare 9-way classification performance with 5-way and 4-way simplifications of the task and assess two types of generalization: (1) new class sessions from previously seen instructors, and (2) previously unseen instructors. We obtain strong results on the new fine-grained task and state-of-the-art on the 4-way task: our best model obtains frame-level error rates of 6.2%, 7.7% and 28.0% when generalizing to unseen instructors for the 4-way, 5-way, and 9-way classification tasks, respectively (relative reductions of 35.4%, 48.3% and 21.6% over a strong baseline). When estimating the aggregate time spent on classroom activities, our average root mean squared error is 1.64 minutes per class session, a 54.9% relative reduction over the baseline.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions