Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Nov 26, 2024

Rongchang Xie, Chen Du, Ping Song, Chang Liu

Figure 1 for MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Figure 2 for MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Figure 3 for MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Figure 4 for MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Share this with someone who'll enjoy it:

Abstract:We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with texture semantic features. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces training difficulty and improves the performance of the unified model. The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.

View paper on

Share this with someone who'll enjoy it:

Title:MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Paper and Code