Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yin

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Mar 07, 2025

Qingxuan Jia, Guoqin Tang, Zeyuan Huang, Zixuan Hao, Ning Ji, Shihang, Yin, Gang Chen

Figure 1 for Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Figure 2 for Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Figure 3 for Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Figure 4 for Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Abstract:Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation, yet challenges persist in executing complex fine manipulation tasks with high speed and precision. While excelling at high-level planning, existing VLM methods struggle to guide robots through precise sequences of fine motor actions. To address this limitation, we introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation. Our method decomposes complex tasks into sub-actions and maintains three key data structures: task memory structure, 2D topology graphs, and 3D spatial networks, achieving high-precision spatial-semantic fusion. These three components collectively accumulate and store critical information throughout task execution, providing rich context for our task-oriented VLM interaction mechanism. This enables VLMs to dynamically adjust guidance based on real-time feedback, generating precise action plans and facilitating step-wise error correction. Experimental validation on complex assembly tasks demonstrates that our algorithm effectively guides robots to rapidly and precisely accomplish fine manipulation in challenging scenarios, significantly advancing robot intelligence for precision tasks.

Via

Access Paper or Ask Questions

RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models

Dec 10, 2024

Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, Pavlo Molchanov

Abstract:Agglomerative models have recently emerged as a powerful approach to training vision foundation models, leveraging multi-teacher distillation from existing models such as CLIP, DINO, and SAM. This strategy enables the efficient creation of robust models, combining the strengths of individual teachers while significantly reducing computational and resource demands. In this paper, we thoroughly analyze state-of-the-art agglomerative models, identifying critical challenges including resolution mode shifts, teacher imbalance, idiosyncratic teacher artifacts, and an excessive number of output tokens. To address these issues, we propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions. Specifically, in the context of Vision Language Models, we introduce a token compression technique to maintain high-resolution information within a fixed token count. We release our top-performing models, available in multiple scales (-B, -L, -H, and -g), alongside inference code and pretrained weights.

Via

Access Paper or Ask Questions