Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Can MLLMs Perform Text-to-Image In-Context Learning?

Feb 02, 2024

Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee

Figure 1 for Can MLLMs Perform Text-to-Image In-Context Learning?

Figure 2 for Can MLLMs Perform Text-to-Image In-Context Learning?

Figure 3 for Can MLLMs Perform Text-to-Image In-Context Learning?

Figure 4 for Can MLLMs Perform Text-to-Image In-Context Learning?

Share this with someone who'll enjoy it:

Abstract:The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at \url{https://github.com/UW-Madison-Lee-Lab/CoBSAT}.

View paper on

Share this with someone who'll enjoy it:

Title:Can MLLMs Perform Text-to-Image In-Context Learning?

Paper and Code