Picture for Yue Fan

Yue Fan

LongViTU: Instruction Tuning for Long-Form Video Understanding

Add code
Jan 09, 2025
Viaarxiv icon

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Add code
Dec 31, 2024
Figure 1 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 2 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 3 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 4 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Viaarxiv icon

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Add code
Dec 20, 2024
Viaarxiv icon

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Add code
Oct 30, 2024
Figure 1 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Figure 2 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Figure 3 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Figure 4 for TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Viaarxiv icon

Toward a Diffusion-Based Generalist for Dense Vision Tasks

Add code
Jun 29, 2024
Figure 1 for Toward a Diffusion-Based Generalist for Dense Vision Tasks
Figure 2 for Toward a Diffusion-Based Generalist for Dense Vision Tasks
Figure 3 for Toward a Diffusion-Based Generalist for Dense Vision Tasks
Figure 4 for Toward a Diffusion-Based Generalist for Dense Vision Tasks
Viaarxiv icon

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

Add code
Jun 27, 2024
Figure 1 for Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Figure 2 for Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Figure 3 for Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Figure 4 for Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Viaarxiv icon

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Add code
Jun 12, 2024
Viaarxiv icon

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Add code
Mar 22, 2024
Viaarxiv icon

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Add code
Mar 18, 2024
Viaarxiv icon

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey

Add code
Feb 08, 2024
Viaarxiv icon