Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Mar 25, 2025

Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen

Figure 1 for LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Figure 2 for LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Figure 3 for LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Figure 4 for LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Share this with someone who'll enjoy it:

Abstract:Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce \textbf{LEGO-Puzzles}, a scalable benchmark designed to evaluate both \textbf{spatial understanding} and \textbf{sequential reasoning} in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90\% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

* 12 pages, 7 figures

View paper on

Share this with someone who'll enjoy it:

Title:LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Paper and Code