Picture for Zhizheng Zhang

Zhizheng Zhang

Southeast University, China

WeGen: A Unified Model for Interactive Multimodal Generation as We Chat

Add code
Mar 03, 2025
Viaarxiv icon

FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real

Add code
Feb 25, 2025
Viaarxiv icon

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Add code
Feb 18, 2025
Viaarxiv icon

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Add code
Dec 09, 2024
Viaarxiv icon

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Add code
Dec 05, 2024
Figure 1 for Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Figure 2 for Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Figure 3 for Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Figure 4 for Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Viaarxiv icon

A General Theory for Compositional Generalization

Add code
May 20, 2024
Viaarxiv icon

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Add code
May 13, 2024
Figure 1 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Figure 2 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Figure 3 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Figure 4 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Viaarxiv icon

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

Add code
Mar 19, 2024
Figure 1 for RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Figure 2 for RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Figure 3 for RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Figure 4 for RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Viaarxiv icon

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

Add code
Mar 19, 2024
Figure 1 for VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Figure 2 for VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Figure 3 for VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Figure 4 for VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Viaarxiv icon

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Add code
Mar 01, 2024
Figure 1 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Figure 2 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Figure 3 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Figure 4 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Viaarxiv icon