Abstract:Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the "Less is More" philosophy in MLLM development.
Abstract:Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio-temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. Specifically, we first induce Spatio-Temporal Scene Graph (STSG) representation of diverse videos to capture fine-grained, multi-granular video semantics. Then, the STSGs guide the derivation of multi-step reasoning Question-Answer (QA) data with Chain-of-Thought (CoT) rationales. Both answers and rationales are integrated as training objective, aiming to enhance model's reasoning abilities by supervision over explicit reasoning steps. Experimental results demonstrate the effectiveness of STEP across models of varying scales, with a significant 21.3\% improvement in tasks requiring three or more reasoning steps. Furthermore, it achieves superior performance with a minimal amount of self-generated rationale-enriched training samples in both compositional reasoning and comprehensive understanding benchmarks, highlighting the broad applicability and vast potential.
Abstract:Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity.
Abstract:In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation. This paper addresses these challenges by proposing a unified approach that integrates the strengths of both paradigms. Considering interleaved image-text sequences as the general format of input samples, we introduce a structure-induced training strategy that imposes semantic relationships between input samples and the MLLM's hidden state. This approach enhances the MLLM's ability to capture global semantics and distinguish fine-grained semantics. By leveraging dynamic sequence alignment within the Dynamic Time Warping framework and integrating a novel kernel for fine-grained semantic differentiation, our method effectively balances generative and discriminative tasks. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks. By employing a retrieval-augmented generation strategy, our approach further enhances performance in some generative tasks within one model, offering a promising direction for future research in vision-language modeling.
Abstract:The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic knowledge editing and external knowledge resorting, each possess strengths and weaknesses, struggling to balance the desired properties of reliability, generality, and locality when applied to MLLMs. In this paper, we propose UniKE, a novel multimodal editing method that establishes a unified perspective and paradigm for intrinsic knowledge editing and external knowledge resorting. Both types of knowledge are conceptualized as vectorized key-value memories, with the corresponding editing processes resembling the assimilation and accommodation phases of human cognition, conducted at the same semantic levels. Within such a unified framework, we further promote knowledge collaboration by disentangling the knowledge representations into the semantic and truthfulness spaces. Extensive experiments validate the effectiveness of our method, which ensures that the post-edit MLLM simultaneously maintains excellent reliability, generality, and locality. The code for UniKE will be available at \url{https://github.com/beepkh/UniKE}.
Abstract:The field of computer-aided synthesis planning (CASP) has seen rapid advancements in recent years, achieving significant progress across various algorithmic benchmarks. However, chemists often encounter numerous infeasible reactions when using CASP in practice. This article delves into common errors associated with CASP and introduces a product prediction model aimed at enhancing the accuracy of single-step models. While the product prediction model reduces the number of single-step reactions, it integrates multiple single-step models to maintain the overall reaction count and increase reaction diversity. Based on manual analysis and large-scale testing, the product prediction model, combined with the multi-model ensemble approach, has been proven to offer higher feasibility and greater diversity.
Abstract:Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. However, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in MLLMs, remain under-explored. This work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly, HalluciDoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. Based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.The source code will be released at \url{https://github.com/Yuqifan1117/HalluciDoctor}.
Abstract:The rising demand for creating lifelike avatars in the digital realm has led to an increased need for generating high-quality human videos guided by textual descriptions and poses. We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues. Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion. The crux of innovation lies in our adept utilization of the T2I diffusion model for producing video frames successively while preserving contextual relevance. We surmount the hurdles posed by maintaining human character and clothing consistency across varying poses, along with upholding the background's continuity amidst diverse human movements. To ensure consistent human appearances across the entire video, we devise an intra-frame alignment module. This module assimilates text-guided synthesized human character knowledge into the pretrained T2I diffusion model, synergizing insights from ChatGPT. For preserving background continuity, we put forth a background alignment pipeline, amalgamating insights from segment anything and image inpainting techniques. Furthermore, we propose an inter-frame alignment module that draws inspiration from an auto-regressive pipeline to augment temporal consistency between adjacent frames, where the preceding frame guides the synthesis process of the current frame. Comparisons with state-of-the-art methods demonstrate that Dancing Avatar exhibits the capacity to generate human videos with markedly superior quality, both in terms of human and background fidelity, as well as temporal coherence compared to existing state-of-the-art approaches.
Abstract:Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. In parallel, the problem of data scarcity has brought a growing interest in employing AIGC technology for high-quality data expansion. However, this paradigm requires well-designed prompt engineering that cost-less data expansion and labeling remain under-explored. Inspired by LLM's powerful capability in task guidance, we propose a new paradigm of annotated data expansion named as ChatGenImage. The core idea behind it is to leverage the complementary strengths of diverse models to establish a highly effective and user-friendly pipeline for interactive data augmentation. In this work, we extensively study how LLMs communicate with AIGC model to achieve more controllable image generation and make the first attempt to collaborate them for automatic data augmentation for a variety of downstream tasks. Finally, we present fascinating results obtained from our ChatGenImage framework and demonstrate the powerful potential of our synthetic data for systematic vision adaptation. Our codes are available at https://github.com/Yuqifan1117/Labal-Anything-Pipeline.
Abstract:Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.