Picture for Rongtao Xu

Rongtao Xu

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

Add code
Jan 26, 2026
Viaarxiv icon

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Add code
Dec 26, 2025
Viaarxiv icon

GLaD: Geometric Latent Distillation for Vision-Language-Action Models

Add code
Dec 10, 2025
Figure 1 for GLaD: Geometric Latent Distillation for Vision-Language-Action Models
Figure 2 for GLaD: Geometric Latent Distillation for Vision-Language-Action Models
Figure 3 for GLaD: Geometric Latent Distillation for Vision-Language-Action Models
Figure 4 for GLaD: Geometric Latent Distillation for Vision-Language-Action Models
Viaarxiv icon

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Add code
Nov 13, 2025
Viaarxiv icon

CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

Add code
Oct 14, 2025
Viaarxiv icon

ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation

Add code
Sep 16, 2025
Viaarxiv icon

$\mathcal{P}^3$: Toward Versatile Embodied Agents

Add code
Aug 09, 2025
Viaarxiv icon

3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Add code
Jul 16, 2025
Figure 1 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
Figure 2 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
Figure 3 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
Figure 4 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
Viaarxiv icon

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

Add code
Jun 10, 2025
Viaarxiv icon

SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

Add code
May 29, 2025
Viaarxiv icon