Picture for Yueting Zhuang

Yueting Zhuang

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Add code
Mar 27, 2025
Viaarxiv icon

SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models

Add code
Mar 10, 2025
Viaarxiv icon

Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

Add code
Mar 09, 2025
Viaarxiv icon

InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

Add code
Mar 09, 2025
Viaarxiv icon

Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts

Add code
Mar 07, 2025
Viaarxiv icon

MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task

Add code
Feb 17, 2025
Viaarxiv icon

STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion

Add code
Feb 10, 2025
Figure 1 for STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion
Figure 2 for STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion
Figure 3 for STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion
Figure 4 for STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion
Viaarxiv icon

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Add code
Jan 08, 2025
Viaarxiv icon

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Add code
Jan 03, 2025
Figure 1 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Figure 2 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Figure 3 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Figure 4 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Viaarxiv icon

MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

Add code
Dec 28, 2024
Figure 1 for MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation
Figure 2 for MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation
Figure 3 for MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation
Figure 4 for MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation
Viaarxiv icon