Picture for Xihan Wei

Xihan Wei

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Add code
Jan 31, 2025
Figure 1 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 2 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 3 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 4 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Viaarxiv icon

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding

Add code
Jan 25, 2025
Figure 1 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 2 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 3 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 4 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Viaarxiv icon

Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis

Add code
Jan 16, 2025
Viaarxiv icon

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

Add code
Jan 14, 2025
Figure 1 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 2 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 3 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 4 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Viaarxiv icon

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

Add code
Jan 09, 2025
Figure 1 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 2 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 3 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 4 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Viaarxiv icon

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models

Add code
Oct 25, 2024
Figure 1 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Figure 2 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Figure 3 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Figure 4 for Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Viaarxiv icon

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

Add code
Apr 09, 2024
Figure 1 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Figure 2 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Figure 3 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Figure 4 for DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Viaarxiv icon

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

Add code
Jul 27, 2022
Figure 1 for Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
Figure 2 for Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
Figure 3 for Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
Figure 4 for Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
Viaarxiv icon

SP-ViT: Learning 2D Spatial Priors for Vision Transformers

Add code
Jun 15, 2022
Figure 1 for SP-ViT: Learning 2D Spatial Priors for Vision Transformers
Figure 2 for SP-ViT: Learning 2D Spatial Priors for Vision Transformers
Figure 3 for SP-ViT: Learning 2D Spatial Priors for Vision Transformers
Figure 4 for SP-ViT: Learning 2D Spatial Priors for Vision Transformers
Viaarxiv icon

Continual Local Replacement for Few-shot Image Recognition

Add code
Jan 23, 2020
Figure 1 for Continual Local Replacement for Few-shot Image Recognition
Figure 2 for Continual Local Replacement for Few-shot Image Recognition
Figure 3 for Continual Local Replacement for Few-shot Image Recognition
Figure 4 for Continual Local Replacement for Few-shot Image Recognition
Viaarxiv icon