Picture for Xihan Wei

Xihan Wei

ViSpeak: Visual Instruction Feedback in Streaming Videos

Add code
Mar 17, 2025
Viaarxiv icon

A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection

Add code
Mar 13, 2025
Viaarxiv icon

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Add code
Mar 07, 2025
Viaarxiv icon

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Add code
Jan 31, 2025
Figure 1 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 2 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 3 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 4 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Viaarxiv icon

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding

Add code
Jan 25, 2025
Figure 1 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 2 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 3 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 4 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Viaarxiv icon

Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis

Add code
Jan 16, 2025
Viaarxiv icon

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

Add code
Jan 14, 2025
Figure 1 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 2 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 3 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 4 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Viaarxiv icon

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

Add code
Jan 09, 2025
Figure 1 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 2 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 3 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 4 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Viaarxiv icon

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models

Add code
Oct 25, 2024
Figure 1 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Figure 2 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Figure 3 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Figure 4 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Viaarxiv icon

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

Add code
Apr 09, 2024
Figure 1 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Figure 2 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Figure 3 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Figure 4 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Viaarxiv icon