Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Justin Sun

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Mar 05, 2024

Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour

Figure 1 for MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Figure 2 for MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Figure 3 for MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Figure 4 for MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Abstract:Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative \textbf{I}mages \textbf{D}ialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

Via

Access Paper or Ask Questions

Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Jan 10, 2024

Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, Yi Zhang

Figure 1 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Figure 2 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Figure 3 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Figure 4 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Abstract:Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Furthermore, this cost increases when the goal is to make the LLM follow a specific workflow within a dialogue instead of single instructions. Inspired by the self-play technique in reinforcement learning and the use of LLMs to simulate human agents, we propose a more effective method for data collection through LLMs engaging in a conversation in various roles. This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning. We introduce an automated way to measure the (partial) success of a dialogue. This metric is used to filter the generated conversational data that is fed back in LLM for training. Based on our automated and human evaluations of conversation quality, we demonstrate that such self-talk data improves results. In addition, we examine the various characteristics that showcase the quality of generated dialogues and how they can be connected to their potential utility as training data.

Via

Access Paper or Ask Questions