Picture for Joya Chen

Joya Chen

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Add code
Apr 22, 2025
Viaarxiv icon

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Add code
Sep 29, 2024
Figure 1 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Figure 2 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Figure 3 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Figure 4 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Viaarxiv icon

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Add code
Aug 29, 2024
Figure 1 for VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Figure 2 for VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Figure 3 for VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Figure 4 for VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Viaarxiv icon

Learning Video Context as Interleaved Multimodal Sequences

Add code
Jul 31, 2024
Figure 1 for Learning Video Context as Interleaved Multimodal Sequences
Figure 2 for Learning Video Context as Interleaved Multimodal Sequences
Figure 3 for Learning Video Context as Interleaved Multimodal Sequences
Figure 4 for Learning Video Context as Interleaved Multimodal Sequences
Viaarxiv icon

VideoLLM-online: Online Video Large Language Model for Streaming Video

Add code
Jun 17, 2024
Figure 1 for VideoLLM-online: Online Video Large Language Model for Streaming Video
Figure 2 for VideoLLM-online: Online Video Large Language Model for Streaming Video
Figure 3 for VideoLLM-online: Online Video Large Language Model for Streaming Video
Figure 4 for VideoLLM-online: Online Video Large Language Model for Streaming Video
Viaarxiv icon

From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

Add code
Jun 12, 2024
Figure 1 for From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition
Figure 2 for From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition
Figure 3 for From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition
Figure 4 for From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition
Viaarxiv icon

Bootstrapping SparseFormers from Vision Foundation Models

Add code
Dec 04, 2023
Figure 1 for Bootstrapping SparseFormers from Vision Foundation Models
Figure 2 for Bootstrapping SparseFormers from Vision Foundation Models
Figure 3 for Bootstrapping SparseFormers from Vision Foundation Models
Figure 4 for Bootstrapping SparseFormers from Vision Foundation Models
Viaarxiv icon

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Add code
Nov 30, 2023
Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Viaarxiv icon

UniVTG: Towards Unified Video-Language Temporal Grounding

Add code
Aug 18, 2023
Viaarxiv icon

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Add code
Jun 28, 2023
Viaarxiv icon