Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

Oct 02, 2024

Minoh Jeong, Min Namgung, Zae Myung Kim, Dongyeop Kang, Yao-Yi Chiang, Alfred Hero

Figure 1 for Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

Figure 2 for Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

Figure 3 for Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

Figure 4 for Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

Share this with someone who'll enjoy it:

Abstract:Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources, such as text, images, and audio, to support a variety of downstream tasks. A unified representation across various modalities is particularly important for improving efficiency and performance. Recent binding methods, such as ImageBind (Girdhar et al., 2023), typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space. In this paper, we mathematically analyze the fixed anchor binding methods and uncover notable limitations: (1) over-reliance on the choice of the anchor modality, (2) failure to capture intra-modal information, and (3) failure to account for inter-modal correlation among non-anchored modalities. To address these limitations, we propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor; instead, it employs dynamically adjustable centroid-based anchors generated from all available modalities, resulting in a balanced and rich representation space. We theoretically demonstrate that our method captures three crucial properties of multimodal learning: intra-modal learning, inter-modal learning, and multimodal alignment, while also constructing a robust unified representation across all modalities. Our experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed method, showing that dynamic anchor methods outperform all fixed anchor binding methods as the former captures more nuanced multimodal interactions.

View paper on

Share this with someone who'll enjoy it:

Title:Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

Paper and Code