Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:X-VILA: Cross-Modality Alignment for Large Language Model

May 29, 2024

Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov(+1 more)

Figure 1 for X-VILA: Cross-Modality Alignment for Large Language Model

Figure 2 for X-VILA: Cross-Modality Alignment for Large Language Model

Figure 3 for X-VILA: Cross-Modality Alignment for Large Language Model

Figure 4 for X-VILA: Cross-Modality Alignment for Large Language Model

Share this with someone who'll enjoy it:

Abstract:We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.

* Technical Report

View paper on

Share this with someone who'll enjoy it:

Title:X-VILA: Cross-Modality Alignment for Large Language Model

Paper and Code