Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xichen Pan

Transfer between Modalities with MetaQueries

Apr 08, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu(+2 more)

Abstract:Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.

* Project Page: https://xichenpan.com/metaquery

Via

Access Paper or Ask Questions

PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Mar 12, 2025

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, Saining Xie

Abstract:Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development.

Via

Access Paper or Ask Questions

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Jun 24, 2024

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan(+4 more)

Figure 1 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Figure 2 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Figure 3 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Figure 4 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Abstract:We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

* Website at https://cambrian-mllm.github.io

Via

Access Paper or Ask Questions

Image Sculpting: Precise Object Editing with 3D Geometry Control

Jan 02, 2024

Jiraphon Yenphraphai, Xichen Pan, Sainan Liu, Daniele Panozzo, Saining Xie

Abstract:We present Image Sculpting, a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. This approach differs markedly from existing methods, which are confined to 2D spaces and typically rely on textual instructions, leading to ambiguity and limited control. Image Sculpting converts 2D objects into 3D, enabling direct interaction with their 3D geometry. Post-editing, these objects are re-rendered into 2D, merging into the original image to produce high-fidelity results through a coarse-to-fine enhancement process. The framework supports precise, quantifiable, and physically-plausible editing options such as pose editing, rotation, translation, 3D composition, carving, and serial addition. It marks an initial step towards combining the creative freedom of generative models with the precision of graphics pipelines.

* Code and project page: https://image-sculpting.github.io

Via

Access Paper or Ask Questions

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Oct 04, 2023

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei

Abstract:Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation."

* Code: https://aka.ms/Kosmos-G Project Page: https://xichenpan.github.io/kosmosg

Via

Access Paper or Ask Questions

Learning Temporal Distribution and Spatial Correlation for Universal Moving Object Segmentation

Apr 19, 2023

Guanfang Dong, Chenqiu Zhao, Xichen Pan, Anup Basu

Figure 1 for Learning Temporal Distribution and Spatial Correlation for Universal Moving Object Segmentation

Figure 2 for Learning Temporal Distribution and Spatial Correlation for Universal Moving Object Segmentation

Figure 3 for Learning Temporal Distribution and Spatial Correlation for Universal Moving Object Segmentation

Figure 4 for Learning Temporal Distribution and Spatial Correlation for Universal Moving Object Segmentation

Abstract:Universal moving object segmentation aims to provide a general model for videos from all types of natural scenes, as previous approaches are usually effective for specific or similar scenes. In this paper, we propose a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation. In the proposed approach, the distribution from temporal pixels is first learned by our Defect Iterative Distribution Learning (DIDL) network for a scene-independent segmentation. Then, the Stochastic Bayesian Refinement (SBR) Network, which learns the spatial correlation, is proposed to improve the binary mask generated by the DIDL network. Benefiting from the scene independence of the temporal distribution and the accuracy improvement resulting from the spatial correlation, the proposed approach performs well for almost all videos from diverse and complex natural scenes with fixed parameters. Comprehensive experiments on standard datasets including LASIESTA, CDNet2014, BMC, SBMI2015 and 128 real world videos demonstrate the superiority of proposed approach compared to state-of-the-art methods with or without the use of deep learning networks. To the best of our knowledge, this work has high potential to be a general solution for moving object segmentation in real world environments.

Via

Access Paper or Ask Questions

Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Nov 20, 2022

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, Wenhu Chen

Figure 1 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Figure 2 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Figure 3 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Figure 4 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Abstract:Conditioned diffusion models have demonstrated state-of-the-art text-to-image synthesis capacity. Recently, most works focus on synthesizing independent images; While for real-world applications, it is common and necessary to generate a series of coherent images for story-stelling. In this work, we mainly focus on story visualization and continuation tasks and propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images. Moreover, AR-LDM can generalize to new characters through adaptation. To our best knowledge, this is the first work successfully leveraging diffusion models for coherent visual story synthesizing. Quantitative results show that AR-LDM achieves SoTA FID scores on PororoSV, FlintstonesSV, and the newly introduced challenging dataset VIST containing natural images. Large-scale human evaluations show that AR-LDM has superior performance in terms of quality, relevance, and consistency.

* Technical Report

Via

Access Paper or Ask Questions

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Mar 26, 2022

Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang, Zhouhan Lin

Figure 1 for Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Figure 2 for Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Figure 3 for Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Figure 4 for Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Abstract:Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.

* ACL2022 Main Conference

Via

Access Paper or Ask Questions