Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

May 10, 2024

Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao(+1 more)

Figure 1 for FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Figure 2 for FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Figure 3 for FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Figure 4 for FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Share this with someone who'll enjoy it:

Abstract:Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via "space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces.

* Accepted by ICML 2024. The code and checkpoints will be released at https://github.com/zehanwang01/FreeBind

View paper on

Share this with someone who'll enjoy it:

Title:FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Paper and Code