Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yingjun Du

QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

Nov 29, 2024

Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G. M. Snoek

Abstract:We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions

CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

Nov 07, 2024

Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G. M. Snoek, Jan-Jakob Sonke, Efstratios Gavves

Figure 1 for CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

Figure 2 for CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

Figure 3 for CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

Figure 4 for CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

Abstract:In this work, we address the cooperation problem among large language model (LLM) based embodied agents, where agents must cooperate to achieve a common goal. Previous methods often execute actions extemporaneously and incoherently, without long-term strategic and cooperative planning, leading to redundant steps, failures, and even serious repercussions in complex tasks like search-and-rescue missions where discussion and cooperative plan are crucial. To solve this issue, we propose Cooperative Plan Optimization (CaPo) to enhance the cooperation efficiency of LLM-based embodied agents. Inspired by human cooperation schemes, CaPo improves cooperation efficiency with two phases: 1) meta-plan generation, and 2) progress-adaptive meta-plan and execution. In the first phase, all agents analyze the task, discuss, and cooperatively create a meta-plan that decomposes the task into subtasks with detailed steps, ensuring a long-term strategic and coherent plan for efficient coordination. In the second phase, agents execute tasks according to the meta-plan and dynamically adjust it based on their latest progress (e.g., discovering a target object) through multi-turn discussions. This progress-based adaptation eliminates redundant actions, improving the overall cooperation efficiency of agents. Experimental results on the ThreeDworld Multi-Agent Transport and Communicative Watch-And-Help tasks demonstrate that CaPo achieves much higher task completion rate and efficiency compared with state-of-the-arts.

* Under review

Via

Access Paper or Ask Questions

Prompt Diffusion Robustifies Any-Modality Prompt Learning

Oct 26, 2024

Yingjun Du, Gaowen Liu, Yuzhang Shang, Yuguang Yao, Ramana Kompella, Cees G. M. Snoek

Figure 1 for Prompt Diffusion Robustifies Any-Modality Prompt Learning

Figure 2 for Prompt Diffusion Robustifies Any-Modality Prompt Learning

Figure 3 for Prompt Diffusion Robustifies Any-Modality Prompt Learning

Figure 4 for Prompt Diffusion Robustifies Any-Modality Prompt Learning

Abstract:Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.

* Under review

Via

Access Paper or Ask Questions

IPO: Interpretable Prompt Optimization for Vision-Language Models

Oct 20, 2024

Yingjun Du, Wenfang Sun, Cees G. M. Snoek

Figure 1 for IPO: Interpretable Prompt Optimization for Vision-Language Models

Figure 2 for IPO: Interpretable Prompt Optimization for Vision-Language Models

Figure 3 for IPO: Interpretable Prompt Optimization for Vision-Language Models

Figure 4 for IPO: Interpretable Prompt Optimization for Vision-Language Models

Abstract:Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for thae creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Training-Free Semantic Segmentation via LLM-Supervision

Mar 31, 2024

Wenfang Sun, Yingjun Du, Gaowen Liu, Ramana Kompella, Cees G. M. Snoek

Figure 1 for Training-Free Semantic Segmentation via LLM-Supervision

Figure 2 for Training-Free Semantic Segmentation via LLM-Supervision

Figure 3 for Training-Free Semantic Segmentation via LLM-Supervision

Figure 4 for Training-Free Semantic Segmentation via LLM-Supervision

Abstract:Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptors. This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM) that does not require extra training. Our method starts from an LLM, like GPT-3, to generate a detailed set of subclasses for more accurate class representation. We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels, resulting in diverse segmentation results tailored to each subclass's unique characteristics. Additionally, we propose an assembly that merges the segmentation maps from the various subclass descriptors to ensure a more comprehensive representation of the different aspects in the test images. Through comprehensive experiments on three standard benchmarks, our method outperforms traditional text-supervised semantic segmentation methods by a marked margin.

* 22 pages,10 figures, conference

Via

Access Paper or Ask Questions

EMO: Episodic Memory Optimization for Few-Shot Meta-Learning

Jun 26, 2023

Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G. M. Snoek

Figure 1 for EMO: Episodic Memory Optimization for Few-Shot Meta-Learning

Figure 2 for EMO: Episodic Memory Optimization for Few-Shot Meta-Learning

Figure 3 for EMO: Episodic Memory Optimization for Few-Shot Meta-Learning

Figure 4 for EMO: Episodic Memory Optimization for Few-Shot Meta-Learning

Abstract:Few-shot meta-learning presents a challenge for gradient descent optimization due to the limited number of training samples per task. To address this issue, we propose an episodic memory optimization for meta-learning, we call EMO, which is inspired by the human ability to recall past learning experiences from the brain's memory. EMO retains the gradient history of past experienced tasks in external memory, enabling few-shot learning in a memory-augmented way. By learning to retain and recall the learning process of past training tasks, EMO nudges parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative. We prove theoretically that our algorithm converges for smooth, strongly convex objectives. EMO is generic, flexible, and model-agnostic, making it a simple plug-and-play optimizer that can be seamlessly embedded into existing optimization-based few-shot meta-learning approaches. Empirical results show that EMO scales well with most few-shot classification benchmarks and improves the performance of optimization-based meta-learning methods, resulting in accelerated convergence.

* Accepted by CoLLAs 2023

Via

Access Paper or Ask Questions

ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion

Jun 26, 2023

Yingjun Du, Zehao Xiao, Shengcai Liao, Cees Snoek

Figure 1 for ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion

Figure 2 for ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion

Figure 3 for ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion

Figure 4 for ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion

Abstract:Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.

* Under review

Via

Access Paper or Ask Questions

Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation

Jun 16, 2023

Shuo Chen, Yingjun Du, Pascal Mettes, Cees G. M. Snoek

Figure 1 for Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation

Figure 2 for Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation

Figure 3 for Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation

Figure 4 for Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation

Abstract:This paper investigates the problem of scene graph generation in videos with the aim of capturing semantic relations between subjects and objects in the form of $\langle$subject, predicate, object$\rangle$ triplets. Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature, ranging from ubiquitous interactions such as spatial relationships (\eg \emph{in front of}) to rare interactions such as \emph{twisting}. In widely-used benchmarks such as Action Genome and VidOR, the imbalance ratio between the most and least frequent predicates reaches 3,218 and 3,408, respectively, surpassing even benchmarks specifically designed for long-tailed recognition. Due to the long-tailed distributions and label co-occurrences, recent state-of-the-art methods predominantly focus on the most frequently occurring predicate classes, ignoring those in the long tail. In this paper, we analyze the limitations of current approaches for scene graph generation in videos and identify a one-to-one correspondence between predicate frequency and recall performance. To make the step towards unbiased scene graph generation in videos, we introduce a multi-label meta-learning framework to deal with the biased predicate distribution. Our meta-learning framework learns a meta-weight network for each training sample over all possible label losses. We evaluate our approach on the Action Genome and VidOR benchmarks by building upon two current state-of-the-art methods for each benchmark. The experiments demonstrate that the multi-label meta-weight network improves the performance for predicates in the long tail without compromising performance for head classes, resulting in better overall performance and favorable generalizability. Code: \url{https://github.com/shanshuo/ML-MWN}.

* ICMR 2023

Via

Access Paper or Ask Questions

MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks

May 17, 2023

Wenfang Sun, Yingjun Du, Xiantong Zhen, Fan Wang, Ling Wang, Cees G. M. Snoek

Figure 1 for MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks

Figure 2 for MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks

Figure 3 for MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks

Figure 4 for MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks

Abstract:Meta-learning algorithms are able to learn a new task using previously learned knowledge, but they often require a large number of meta-training tasks which may not be readily available. To address this issue, we propose a method for few-shot learning with fewer tasks, which we call MetaModulation. The key idea is to use a neural network to increase the density of the meta-training tasks by modulating batch normalization parameters during meta-training. Additionally, we modify parameters at various network levels, rather than just a single layer, to increase task diversity. To account for the uncertainty caused by the limited training tasks, we propose a variational MetaModulation where the modulation parameters are treated as latent variables. We also introduce learning variational feature hierarchies by the variational MetaModulation, which modulates features at all layers and can consider task uncertainty and generate more diverse tasks. The ablation studies illustrate the advantages of utilizing a learnable task modulation at different levels and demonstrate the benefit of incorporating probabilistic variants in few-task meta-learning. Our MetaModulation and its variational variants consistently outperform state-of-the-art alternatives on four few-task meta-learning benchmarks.

* Accepted by ICML 2023

Via

Access Paper or Ask Questions

SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail

Mar 31, 2023

Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G. M. Snoek

Abstract:Modern image classifiers perform well on populated classes, while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions.

* Accepted by CVPR 2023

Via

Access Paper or Ask Questions