Picture for Xitong Yang

Xitong Yang

Progress-Aware Video Frame Captioning

Add code
Dec 03, 2024
Viaarxiv icon

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Add code
Sep 30, 2024
Figure 1 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Figure 2 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Figure 3 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Figure 4 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Viaarxiv icon

GenRec: Unifying Video Generation and Recognition with Diffusion Models

Add code
Aug 27, 2024
Figure 1 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Figure 2 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Figure 3 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Figure 4 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Viaarxiv icon

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

Add code
Aug 07, 2024
Viaarxiv icon

Video ReCap: Recursive Captioning of Hour-Long Videos

Add code
Feb 28, 2024
Viaarxiv icon

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Add code
Nov 30, 2023
Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Viaarxiv icon

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

Add code
Oct 08, 2023
Viaarxiv icon

Towards Scalable Neural Representation for Diverse Videos

Add code
Mar 24, 2023
Viaarxiv icon

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Add code
Feb 16, 2023
Figure 1 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 2 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 3 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 4 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Viaarxiv icon

Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

Add code
Feb 01, 2023
Viaarxiv icon