Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Changyou Chen

ANU & NICTA

A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Dec 20, 2024

Shijie Zhou, Ruiyi Zhang, Yufan Zhou, Changyou Chen

Abstract:Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct data.

* COLING 2025

Via

Access Paper or Ask Questions

Enhancing Diffusion Posterior Sampling for Inverse Problems by Integrating Crafted Measurements

Nov 15, 2024

Shijie Zhou, Huaisheng Zhu, Rohan Sharma, Ruiyi Zhang, Kaiyi Ji, Changyou Chen

Abstract:Diffusion models have emerged as a powerful foundation model for visual generation. With an appropriate sampling process, it can effectively serve as a generative prior to solve general inverse problems. Current posterior sampling based methods take the measurement (i.e., degraded image sample) into the posterior sampling to infer the distribution of the target data (i.e., clean image sample). However, in this manner, we show that high-frequency information can be prematurely introduced during the early stages, which could induce larger posterior estimate errors during the restoration sampling. To address this issue, we first reveal that forming the log posterior gradient with the noisy measurement ( i.e., samples from a diffusion forward process) instead of the clean one can benefit the reverse process. Consequently, we propose a novel diffusion posterior sampling method DPS-CM, which incorporates a Crafted Measurement (i.e., samples generated by a reverse denoising process, compared to random sampling with noise in standard methods) to form the posterior estimate. This integration aims to mitigate the misalignment with the diffusion prior caused by cumulative posterior estimate errors. Experimental results demonstrate that our approach significantly improves the overall capacity to solve general and noisy inverse problems, such as Gaussian deblurring, super-resolution, inpainting, nonlinear deblurring, and tasks with Poisson noise, relative to existing approaches.

Via

Access Paper or Ask Questions

LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding

Nov 02, 2024

Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A. Rossi, Changyou Chen, Tong Sun

Abstract:Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to LMMs leads to inefficiencies, especially with lengthy documents. In this work, we present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL), which broadens the capabilities of any LMM to support long-document understanding. We demonstrate that LMMs can effectively serve as multimodal retrievers, fetching relevant pages to answer user questions based on these pages. LoCAL is implemented with two specific LMM adapters: one for evidence page retrieval and another for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of LoCAL.

* Currently Under Review

Via

Access Paper or Ask Questions

TextLap: Customizing Language Models for Text-to-Layout Planning

Oct 09, 2024

Jian Chen, Ruiyi Zhang, Yufan Zhou, Jennifer Healey, Jiuxiang Gu, Zhiqiang Xu, Changyou Chen

Abstract:Automatic generation of graphical layouts is crucial for many real-world applications, including designing posters, flyers, advertisements, and graphical user interfaces. Given the incredible ability of Large language models (LLMs) in both natural language understanding and generation, we believe that we could customize an LLM to help people create compelling graphical layouts starting with only text instructions from the user. We call our method TextLap (text-based layout planning). It uses a curated instruction-based layout planning dataset (InsLap) to customize LLMs as a graphic designer. We demonstrate the effectiveness of TextLap and show that it outperforms strong baselines, including GPT-4 based methods, for image generation and graphical design benchmarks.

* Accepted to the EMNLP Findings

Via

Access Paper or Ask Questions

MMR: Evaluating Reading Ability of Large Multimodal Models

Aug 26, 2024

Jian Chen, Ruiyi Zhang, Yufan Zhou, Ryan Rossi, Jiuxiang Gu, Changyou Chen

Figure 1 for MMR: Evaluating Reading Ability of Large Multimodal Models

Figure 2 for MMR: Evaluating Reading Ability of Large Multimodal Models

Figure 3 for MMR: Evaluating Reading Ability of Large Multimodal Models

Figure 4 for MMR: Evaluating Reading Ability of Large Multimodal Models

Abstract:Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

Via

Access Paper or Ask Questions

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Jul 27, 2024

Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong Sun

Figure 1 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 2 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 3 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 4 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Abstract:Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.

* NeurIPS 2024 Under Review

Via

Access Paper or Ask Questions

Diffusion Models for Multi-Task Generative Modeling

Jul 24, 2024

Changyou Chen, Han Ding, Bunyamin Sisman, Yi Xu, Ouye Xie, Benjamin Z. Yao, Son Dinh Tran, Belinda Zeng

Figure 1 for Diffusion Models for Multi-Task Generative Modeling

Figure 2 for Diffusion Models for Multi-Task Generative Modeling

Figure 3 for Diffusion Models for Multi-Task Generative Modeling

Figure 4 for Diffusion Models for Multi-Task Generative Modeling

Abstract:Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.

* Published as a conference paper at ICLR 2024

Via

Access Paper or Ask Questions

Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

Jul 24, 2024

Jingchen Sun, Rohan Sharma, Vishnu Suresh Lokhande, Changyou Chen

Figure 1 for Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

Figure 2 for Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

Figure 3 for Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

Figure 4 for Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

Abstract:Prompt Tuning has emerged as a prominent research paradigm for adapting vision-language models to various downstream tasks. However, recent research indicates that prompt tuning methods often lead to overfitting due to limited training samples. In this paper, we propose a Cross-modal Aligned Feature Tuning (Craft) method to address this issue. Cross-modal alignment is conducted by first selecting anchors from the alternative domain and deriving relative representations of the embeddings for the selected anchors. Optimizing for a feature alignment loss over anchor-aligned text and image modalities creates a more unified text-image common space. Overfitting in prompt tuning also deteriorates model performance on out-of-distribution samples. To further improve the prompt model's robustness, we propose minimizing Maximum Mean Discrepancy (MMD) over the anchor-aligned feature spaces to mitigate domain shift. The experiment on four different prompt tuning structures consistently shows the improvement of our method, with increases of up to $6.1\%$ in the Base-to-Novel generalization task, $5.8\%$ in the group robustness task, and $2.7\%$ in the out-of-distribution tasks. The code will be available at https://github.com/Jingchensun/Craft

* 15pages

Via

Access Paper or Ask Questions

TRINS: Towards Multimodal Language Models that Can Read

Jun 10, 2024

Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun

Figure 1 for TRINS: Towards Multimodal Language Models that Can Read

Figure 2 for TRINS: Towards Multimodal Language Models that Can Read

Figure 3 for TRINS: Towards Multimodal Language Models that Can Read

Figure 4 for TRINS: Towards Multimodal Language Models that Can Read

Abstract:Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.

* CVPR 2024

Via

Access Paper or Ask Questions

LLaVA Finds Free Lunch: Teaching Human Behavior Improves Content Understanding Abilities Of LLMs

May 02, 2024

Somesh Singh, Harini S I, Yaman K Singla, Veeky Baths, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy

Abstract:Communication is defined as ``Who says what to whom with what effect.'' A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks. We show this performance increase over 40 video and image understanding tasks over 23 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.

Via

Access Paper or Ask Questions