Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiqing Liang

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

May 30, 2025

Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, Jiacheng Zhu

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

* Project Webpage: https://modomodo-rl.github.io/

Via

Access Paper or Ask Questions

Zero-Shot Monocular Scene Flow Estimation in the Wild

Jan 17, 2025

Yiqing Liang, Abhishek Badki, Hang Su, James Tompkin, Orazio Gallo

Abstract:Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each.First, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.

* Project Website: https://research.nvidia.com/labs/zero_msf

Via

Access Paper or Ask Questions

Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps

Dec 05, 2024

Yiqing Liang, Mikhail Okunev, Mikaela Angelina Uy, Runfeng Li, Leonidas Guibas, James Tompkin, Adam W. Harley

Figure 1 for Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps

Figure 2 for Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps

Figure 3 for Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps

Figure 4 for Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps

Abstract:Gaussian splatting methods are emerging as a popular approach for converting multi-view image data into scene representations that allow view synthesis. In particular, there is interest in enabling view synthesis for dynamic scenes using only monocular input data -- an ill-posed and challenging problem. The fast pace of work in this area has produced multiple simultaneous papers that claim to work best, which cannot all be true. In this work, we organize, benchmark, and analyze many Gaussian-splatting-based methods, providing apples-to-apples comparisons that prior works have lacked. We use multiple existing datasets and a new instructive synthetic dataset designed to isolate factors that affect reconstruction quality. We systematically categorize Gaussian splatting methods into specific motion representation types and quantify how their differences impact performance. Empirically, we find that their rank order is well-defined in synthetic data, but the complexity of real-world data currently overwhelms the differences. Furthermore, the fast rendering speed of all Gaussian-based methods comes at the cost of brittleness in optimization. We summarize our experiments into a list of findings that can help to further progress in this lively problem setting. Project Webpage: https://lynl7130.github.io/MonoDyGauBench.github.io/

* 37 pages, 39 figures, 9 tables

Via

Access Paper or Ask Questions

GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis

Dec 18, 2023

Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, Lei Xiao

Figure 1 for GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis

Figure 2 for GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis

Figure 3 for GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis

Figure 4 for GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis

Abstract:We propose a method for dynamic scene reconstruction using deformable 3D Gaussians that is tailored for monocular video. Building upon the efficiency of Gaussian splatting, our approach extends the representation to accommodate dynamic elements via a deformable set of Gaussians residing in a canonical space, and a time-dependent deformation field defined by a multi-layer perceptron (MLP). Moreover, under the assumption that most natural scenes have large regions that remain static, we allow the MLP to focus its representational power by additionally including a static Gaussian point cloud. The concatenated dynamic and static point clouds form the input for the Gaussian Splatting rasterizer, enabling real-time rendering. The differentiable pipeline is optimized end-to-end with a self-supervised rendering loss. Our method achieves results that are comparable to state-of-the-art dynamic neural radiance field methods while allowing much faster optimization and rendering. Project website: https://lynl7130.github.io/gaufre/index.html

* 10 pages, 8 figures, 4 tables

Via

Access Paper or Ask Questions

Semantic Attention Flow Fields for Dynamic Scene Decomposition

Mar 02, 2023

Yiqing Liang, Eliot Laidlaw, Alexander Meyerowitz, Srinath Sridhar, James Tompkin

Figure 1 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Figure 2 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Figure 3 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Figure 4 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Abstract:We present SAFF: a dynamic neural volume reconstruction of a casual monocular video that consists of time-varying color, density, scene flow, semantics, and attention information. The semantics and attention let us identify salient foreground objects separately from the background in arbitrary spacetime views. We add two network heads to represent the semantic and attention information. For optimization, we design semantic attention pyramids from DINO-ViT outputs that trade detail with whole-image context. After optimization, we perform a saliency-aware clustering to decompose the scene. For evaluation on real-world dynamic scene decomposition across spacetime, we annotate object masks in the NVIDIA Dynamic Scene Dataset. We demonstrate that SAFF can decompose dynamic scenes without affecting RGB or depth reconstruction quality, that volume-integrated SAFF outperforms 2D baselines, and that SAFF improves foreground/background segmentation over recent static/dynamic split methods. Project Webpage: https://visual.cs.brown.edu/saff

Via

Access Paper or Ask Questions

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Dec 16, 2021

Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

Figure 1 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Figure 2 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Figure 3 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Figure 4 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Abstract:Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.

* AAAI 2022
* AAAI 2022

Via

Access Paper or Ask Questions

SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation

Dec 08, 2020

Yiqing Liang, Boyuan Chen, Shuran Song

Figure 1 for SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation

Figure 2 for SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation

Figure 3 for SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation

Figure 4 for SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation

Abstract:This paper focuses on visual semantic navigation, the task of producing actions for an active agent to navigate to a specified target object category in an unknown environment. To complete this task, the algorithm should simultaneously locate and navigate to an instance of the category. In comparison to the traditional point goal navigation, this task requires the agent to have a stronger contextual prior of indoor environments. We introduce SSCNav, an algorithm that explicitly models scene priors using a confidence-aware semantic scene completion module to complete the scene and guide the agent's navigation planning. Given a partial observation of the environment, SSCNav first infers a complete scene representation with semantic labels for the unobserved scene together with a confidence map associated with its own prediction. Then, a policy network infers the action from the scene completion result and confidence map. Our experiments demonstrate that the proposed scene completion module improves the efficiency of the downstream navigation policies. https://youtu.be/tfBbdGS72zg

Via

Access Paper or Ask Questions