Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nathalie Rauschmayr

Test-Time Visual In-Context Tuning

Mar 27, 2025

Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele

Figure 1 for Test-Time Visual In-Context Tuning

Figure 2 for Test-Time Visual In-Context Tuning

Figure 3 for Test-Time Visual In-Context Tuning

Figure 4 for Test-Time Visual In-Context Tuning

Abstract:Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.

* CVPR 2025. Code: https://github.com/Jiahao000/VICT

Via

Access Paper or Ask Questions

I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning

Feb 26, 2025

Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari

Abstract:Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.

Via

Access Paper or Ask Questions

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Oct 15, 2024

Shangbin Feng, Zifeng Wang, Yike Wang, Sayna Ebrahimi, Hamid Palangi, Lesly Miculicich, Achin Kulshrestha, Nathalie Rauschmayr, Yejin Choi, Yulia Tsvetkov(+2 more)

Figure 1 for Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Figure 2 for Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Figure 3 for Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Figure 4 for Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Abstract:We propose Model Swarms, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, Model Swarms offers tuning-free model adaptation, works in low-data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process.

Via

Access Paper or Ask Questions

Extracting Training Data from Document-Based VQA Models

Jul 11, 2024

Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari

Figure 1 for Extracting Training Data from Document-Based VQA Models

Figure 2 for Extracting Training Data from Document-Based VQA Models

Figure 3 for Extracting Training Data from Document-Based VQA Models

Figure 4 for Extracting Training Data from Document-Based VQA Models

Abstract:Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image). In this work, we show these models can memorize responses for training samples and regurgitate them even when the relevant visual information has been removed. This includes Personal Identifiable Information (PII) repeated once in the training set, indicating these models could divulge memorised sensitive information and therefore pose a privacy risk. We quantitatively measure the extractability of information in controlled experiments and differentiate between cases where it arises from generalization capabilities or from memorization. We further investigate the factors that influence memorization across multiple state-of-the-art models and propose an effective heuristic countermeasure that empirically prevents the extractability of PII.

* ICML 2024

Via

Access Paper or Ask Questions

Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy

Feb 11, 2021

Dylan Slack, Nathalie Rauschmayr, Krishnaram Kenthapadi

Figure 1 for Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy

Figure 2 for Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy

Figure 3 for Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy

Figure 4 for Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy

Abstract:We typically compute aggregate statistics on held-out test data to assess the generalization of machine learning models. However, statistics on test data often overstate model generalization, and thus, the performance of deployed machine learning models can be variable and untrustworthy. Motivated by these concerns, we develop methods to automatically discover and correct model errors beyond those available in the data. We propose Defuse, a method that generates novel model misclassifications, categorizes these errors into high-level model bugs, and efficiently labels and fine-tunes on the errors to correct them. To generate misclassified data, we propose an algorithm inspired by adversarial machine learning techniques that uses a generative model to find naturally occurring instances misclassified by a model. Further, we observe that the generative models have regions in their latent space with higher concentrations of misclassifications. We call these regions misclassification regions and find they have several useful properties. Each region contains a specific type of model bug; for instance, a misclassification region for an MNIST classifier contains a style of skinny 6 that the model mistakes as a 1. We can also assign a single label to each region, facilitating low-cost labeling. We propose a method to learn the misclassification regions and use this insight to both categorize errors and correct them. In practice, Defuse finds and corrects novel errors in classifiers. For example, Defuse shows that a high-performance traffic sign classifier mistakes certain 50km/h signs as 80km/h. Defuse corrects the error after fine-tuning while maintaining generalization on the test set.

Via

Access Paper or Ask Questions