Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuta Nakashima

PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning

Sep 26, 2025

Jiahao Zhang, Bowen Wang, Hong Liu, Yuta Nakashima, Hajime Nagahara

Abstract:Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks. However, VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions. We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs. PANICL smooths assignment scores across pairs, reducing bias without requiring additional training. Extensive experiments on a variety of tasks, including foreground segmentation, single object detection, colorization, multi-object segmentation, and keypoint detection, demonstrate consistent improvements over strong baselines. Moreover, PANICL exhibits strong robustness to domain shifts, including dataset-level shift (e.g., from COCO to Pascal) and label-space shift (e.g., FSS-1000), and generalizes well to other VICL models such as SegGPT, Painter, and LVM, highlighting its versatility and broad applicability.

* 21 pages, 12 figures

Via

Access Paper or Ask Questions

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Sep 09, 2025

Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia(+2 more)

Figure 1 for Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Figure 2 for Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Figure 3 for Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Figure 4 for Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Abstract:Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

* ICCV 2025

Via

Access Paper or Ask Questions

From Global to Local: Social Bias Transfer in CLIP

Aug 25, 2025

Ryan Ramos, Yusuke Hirota, Yuta Nakashima, Noa Garcia

Abstract:The recycling of contrastive language-image pre-trained (CLIP) models as backbones for a large number of downstream tasks calls for a thorough analysis of their transferability implications, especially their well-documented reproduction of social biases and human stereotypes. How do such biases, learned during pre-training, propagate to downstream applications like visual question answering or image captioning? Do they transfer at all? We investigate this phenomenon, referred to as bias transfer in prior literature, through a comprehensive empirical analysis. Firstly, we examine how pre-training bias varies between global and local views of data, finding that bias measurement is highly dependent on the subset of data on which it is computed. Secondly, we analyze correlations between biases in the pre-trained models and the downstream tasks across varying levels of pre-training bias, finding difficulty in discovering consistent trends in bias transfer. Finally, we explore why this inconsistency occurs, showing that under the current paradigm, representation spaces of different pre-trained CLIPs tend to converge when adapted for downstream tasks. We hope this work offers valuable insights into bias behavior and informs future research to promote better bias mitigation practices.

Via

Access Paper or Ask Questions

QuMAB: Query-based Multi-annotator Behavior Pattern Learning

Jul 23, 2025

Liyun Zhang, Zheng Lian, Hong Liu, Takanori Takebe, Yuta Nakashima

Abstract:Multi-annotator learning traditionally aggregates diverse annotations to approximate a single ground truth, treating disagreements as noise. However, this paradigm faces fundamental challenges: subjective tasks often lack absolute ground truth, and sparse annotation coverage makes aggregation statistically unreliable. We introduce a paradigm shift from sample-wise aggregation to annotator-wise behavior modeling. By treating annotator disagreements as valuable information rather than noise, modeling annotator-specific behavior patterns can reconstruct unlabeled data to reduce annotation cost, enhance aggregation reliability, and explain annotator decision behavior. To this end, we propose QuMATL (Query-based Multi-Annotator Behavior Pattern Learning), which uses light-weight queries to model individual annotators while capturing inter-annotator correlations as implicit regularization, preventing overfitting to sparse individual data while maintaining individualization and improving generalization, with a visualization of annotator focus regions offering an explainable analysis of behavior understanding. We contribute two large-scale datasets with dense per-annotator labels: STREET (4,300 labels/annotator) and AMER (average 3,118 labels/annotator), the first multimodal multi-annotator dataset.

* 12 pages. arXiv admin note: substantial text overlap with arXiv:2503.15237

Via

Access Paper or Ask Questions

E-InMeMo: Enhanced Prompting for Visual In-Context Learning

Apr 25, 2025

Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara

Abstract:Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo

* Preprint

Via

Access Paper or Ask Questions

Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Mar 12, 2025

Zongshang Pang, Mayu Otani, Yuta Nakashima

Figure 1 for Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Figure 2 for Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Figure 3 for Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Figure 4 for Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Abstract:Localizing user-queried events through natural language is crucial for video understanding models. Recent methods predominantly adapt Video LLMs to generate event boundary timestamps to handle temporal localization tasks, which struggle to leverage LLMs' powerful semantic understanding. In this work, we introduce MeCo, a novel timestamp-free framework that enables video LLMs to fully harness their intrinsic semantic capabilities for temporal localization tasks. Rather than outputting boundary timestamps, MeCo partitions videos into holistic event and transition segments based on the proposed structural token generation and grounding pipeline, derived from video LLMs' temporal structure understanding capability. We further propose a query-focused captioning task that compels the LLM to extract fine-grained, event-specific details, bridging the gap between localization and higher-level semantics and enhancing localization performance. Extensive experiments on diverse temporal localization tasks show that MeCo consistently outperforms boundary-centric methods, underscoring the benefits of a semantic-driven approach for temporal localization with video LLMs.

Via

Access Paper or Ask Questions

No Annotations for Object Detection in Art through Stable Diffusion

Dec 09, 2024

Patrick Ramos, Nicolas Gonthier, Selina Khan, Yuta Nakashima, Noa Garcia

Figure 1 for No Annotations for Object Detection in Art through Stable Diffusion

Figure 2 for No Annotations for Object Detection in Art through Stable Diffusion

Figure 3 for No Annotations for Object Detection in Art through Stable Diffusion

Figure 4 for No Annotations for Object Detection in Art through Stable Diffusion

Abstract:Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models' art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at https://github.com/patrick-john-ramos/nada

* 8 pages, 6 figures, to be published in WACV 2025

Via

Access Paper or Ask Questions

VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Dec 05, 2024

Jiahao Zhang, Ryota Yoshihashi, Shunsuke Kitada, Atsuki Osanai, Yuta Nakashima

Figure 1 for VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Figure 2 for VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Figure 3 for VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Figure 4 for VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Abstract:Large language models (LLMs) have proven effective for layout generation due to their ability to produce structure-description languages, such as HTML or JSON, even without access to visual information. Recently, LLM providers have evolved these models into large vision-language models (LVLM), which shows prominent multi-modal understanding capabilities. Then, how can we leverage this multi-modal power for layout generation? To answer this, we propose Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based content-aware layout generation. In our method, LVLMs iteratively refine their outputs with reference to rendered layout images, which are visualized as colored bounding boxes on poster backgrounds. In experiments, we demonstrate that our method combined with the Gemini. Without any additional training, VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming both existing layout-specific generative models and other LLM-based methods.

Via

Access Paper or Ask Questions

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

Oct 14, 2024

Zhouqiang Jiang, Bowen Wang, Junhao Chen, Yuta Nakashima

Abstract:Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to automatically identify such grouping, we argue that current VrDU approaches are unrealistic. We thus introduce a new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups. We also propose a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together. Our experimental results demonstrate the performance of existing methods is deteriorated with the ReVrDU task, while ReLayout shows superiour performance.

Via

Access Paper or Ask Questions

Putting People in LLMs' Shoes: Generating Better Answers via Question Rewriter

Aug 20, 2024

Junhao Chen, Bowen Wang, Zhouqiang jiang, Yuta Nakashima

Figure 1 for Putting People in LLMs' Shoes: Generating Better Answers via Question Rewriter

Figure 2 for Putting People in LLMs' Shoes: Generating Better Answers via Question Rewriter

Figure 3 for Putting People in LLMs' Shoes: Generating Better Answers via Question Rewriter

Figure 4 for Putting People in LLMs' Shoes: Generating Better Answers via Question Rewriter

Abstract:Large Language Models (LLMs) have demonstrated significant capabilities, particularly in the domain of question answering (QA). However, their effectiveness in QA is often undermined by the vagueness of user questions. To address this issue, we introduce single-round instance-level prompt optimization, referred to as question rewriter. By enhancing the intelligibility of human questions for black-box LLMs, our question rewriter improves the quality of generated answers. The rewriter is optimized using direct preference optimization based on feedback collected from automatic criteria for evaluating generated answers; therefore, its training does not require costly human annotations. The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method. This paper provides a practical framework for training question rewriters and sets a precedent for future explorations in prompt optimization within LFQA tasks. Code is available at \url{https://github.com/3244we/Question-Rewriter}.

* 7 pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions