Picture for Yapeng Tian

Yapeng Tian

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Add code
Nov 26, 2024
Viaarxiv icon

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Add code
Nov 19, 2024
Viaarxiv icon

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

Add code
Nov 15, 2024
Figure 1 for Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Figure 2 for Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Figure 3 for Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Figure 4 for Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Viaarxiv icon

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Add code
Nov 07, 2024
Figure 1 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Figure 2 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Figure 3 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Figure 4 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Viaarxiv icon

Continual Audio-Visual Sound Separation

Add code
Nov 05, 2024
Figure 1 for Continual Audio-Visual Sound Separation
Figure 2 for Continual Audio-Visual Sound Separation
Figure 3 for Continual Audio-Visual Sound Separation
Figure 4 for Continual Audio-Visual Sound Separation
Viaarxiv icon

Scaling Concept With Text-Guided Diffusion Models

Add code
Oct 31, 2024
Viaarxiv icon

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

Add code
Oct 30, 2024
Figure 1 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Figure 2 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Figure 3 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Figure 4 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Viaarxiv icon

Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Add code
Oct 15, 2024
Figure 1 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models
Figure 2 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models
Figure 3 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models
Figure 4 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models
Viaarxiv icon

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Add code
Oct 09, 2024
Figure 1 for Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
Figure 2 for Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
Figure 3 for Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
Figure 4 for Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
Viaarxiv icon

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

Add code
Sep 11, 2024
Figure 1 for DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures
Figure 2 for DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures
Figure 3 for DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures
Figure 4 for DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures
Viaarxiv icon