Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuxiao Yang

Multimodal LLMs for Visualization Reconstruction and Understanding

Jun 26, 2025

Can Liu, Chunlin Da, Xiaoxiao Long, Yuxiao Yang, Yu Zhang, Yong Wang

Abstract:Visualizations are crucial for data communication, yet understanding them requires comprehension of both visual elements and their underlying data relationships. Current multimodal large models, while effective in natural image understanding, struggle with visualization due to their inability to decode the data-to-visual mapping rules and extract structured information. To address these challenges, we present a novel dataset and train multimodal visualization LLMs specifically designed for understanding. Our approach combines chart images with their corresponding vectorized representations, encoding schemes, and data features. The proposed vector format enables compact and accurate reconstruction of visualization content. Experimental results demonstrate significant improvements in both data extraction accuracy and chart reconstruction quality.

Via

Access Paper or Ask Questions

NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation

Jun 09, 2025

Yuxiao Yang, Peihao Li, Yuhong Zhang, Junzhe Lu, Xianglong He, Minghan Qin, Weitao Wang, Haoqian Wang

Abstract:3D AI-generated content (AIGC) has made it increasingly accessible for anyone to become a 3D content creator. While recent methods leverage Score Distillation Sampling to distill 3D objects from pretrained image diffusion models, they often suffer from inadequate 3D priors, leading to insufficient multi-view consistency. In this work, we introduce NOVA3D, an innovative single-image-to-3D generation framework. Our key insight lies in leveraging strong 3D priors from a pretrained video diffusion model and integrating geometric information during multi-view video fine-tuning. To facilitate information exchange between color and geometric domains, we propose the Geometry-Temporal Alignment (GTA) attention mechanism, thereby improving generalization and multi-view consistency. Moreover, we introduce the de-conflict geometry fusion algorithm, which improves texture fidelity by addressing multi-view inaccuracies and resolving discrepancies in pose alignment. Extensive experiments validate the superiority of NOVA3D over existing baselines.

* 8 pages, 7 figures, accepted by ICME 2025

Via

Access Paper or Ask Questions

Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Dec 27, 2024

Yuxiao Yang, Shenao Zhang, Zhihan Liu, Huaxiu Yao, Zhaoran Wang

Figure 1 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Figure 2 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Figure 3 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Figure 4 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Abstract:This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs). Previous works typically train a planner to imitate expert trajectories, treating this as a supervised task. While these methods achieve competitive performance, they often lack sufficient robustness. When a suboptimal action is taken, the planner may encounter an out-of-distribution state, which can lead to task failure. In contrast, we frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Thus, we propose a closed-loop planner with an adaptation module and a novel hindsight method, aiming to use as much information as possible to assist the planner. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption. For the first time, our few-shot agent's performance approaches and even surpasses that of the full-shot supervised agent.

Via

Access Paper or Ask Questions

MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Dec 09, 2024

Weitao Wang, Haoran Xu, Yuxiao Yang, Zhifang Liu, Jun Meng, Haoqian Wang

Figure 1 for MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Figure 2 for MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Figure 3 for MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Figure 4 for MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Abstract:Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL$\cdot$E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.

Via

Access Paper or Ask Questions

Can the Inference Logic of Large Language Models be Disentangled into Symbolic Concepts?

Apr 03, 2023

Wen Shen, Lei Cheng, Yuxiao Yang, Mingjie Li, Quanshi Zhang

Figure 1 for Can the Inference Logic of Large Language Models be Disentangled into Symbolic Concepts?

Figure 2 for Can the Inference Logic of Large Language Models be Disentangled into Symbolic Concepts?

Figure 3 for Can the Inference Logic of Large Language Models be Disentangled into Symbolic Concepts?

Figure 4 for Can the Inference Logic of Large Language Models be Disentangled into Symbolic Concepts?

Abstract:In this paper, we explain the inference logic of large language models (LLMs) as a set of symbolic concepts. Many recent studies have discovered that traditional DNNs usually encode sparse symbolic concepts. However, because an LLM has much more parameters than traditional DNNs, whether the LLM also encodes sparse symbolic concepts is still an open problem. Therefore, in this paper, we propose to disentangle the inference score of LLMs for dialogue tasks into a small number of symbolic concepts. We verify that we can use those sparse concepts to well estimate all inference scores of the LLM on all arbitrarily masking states of the input sentence. We also evaluate the transferability of concepts encoded by an LLM and verify that symbolic concepts usually exhibit high transferability across similar input sentences. More crucially, those symbolic concepts can be used to explain the exact reasons accountable for the LLM's prediction errors.

Via

Access Paper or Ask Questions