Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhu Liang

Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Dec 18, 2025

Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun(+2 more)

Figure 1 for Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Figure 2 for Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Figure 3 for Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Figure 4 for Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Abstract:Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.

Via

Access Paper or Ask Questions

Decoding Structure-Spectrum Relationships with Physically Organized Latent Spaces

Jan 11, 2023

Zhu Liang, Matthew R. Carbone, Wei Chen, Fanchen Meng, Eli Stavitski, Deyu Lu, Mark S. Hybertsen, Xiaohui Qu

Figure 1 for Decoding Structure-Spectrum Relationships with Physically Organized Latent Spaces

Figure 2 for Decoding Structure-Spectrum Relationships with Physically Organized Latent Spaces

Figure 3 for Decoding Structure-Spectrum Relationships with Physically Organized Latent Spaces

Figure 4 for Decoding Structure-Spectrum Relationships with Physically Organized Latent Spaces

Abstract:A new semi-supervised machine learning method for the discovery of structure-spectrum relationships is developed and demonstrated using the specific example of interpreting X-ray absorption near-edge structure (XANES) spectra. This method constructs a one-to-one mapping between individual structure descriptors and spectral trends. Specifically, an adversarial autoencoder is augmented with a novel rank constraint (RankAAE). The RankAAE methodology produces a continuous and interpretable latent space, where each dimension can track an individual structure descriptor. As a part of this process, the model provides a robust and quantitative measure of the structure-spectrum relationship by decoupling intertwined spectral contributions from multiple structural characteristics. This makes it ideal for spectral interpretation and the discovery of new descriptors. The capability of this procedure is showcased by considering five local structure descriptors and a database of over fifty thousand simulated XANES spectra across eight first-row transition metal oxide families. The resulting structure-spectrum relationships not only reproduce known trends in the literature, but also reveal unintuitive ones that are visually indiscernible in large data sets. The results suggest that the RankAAE methodology has great potential to assist researchers to interpret complex scientific data, test physical hypotheses, and reveal new patterns that extend scientific insight.

Via

Access Paper or Ask Questions