Picture for Joon Son Chung

Joon Son Chung

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Add code
Mar 24, 2025
Viaarxiv icon

Deep Understanding of Sign Language for Sign to Subtitle Alignment

Add code
Mar 05, 2025
Viaarxiv icon

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

Add code
Jan 16, 2025
Figure 1 for LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Figure 2 for LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Figure 3 for LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Figure 4 for LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Viaarxiv icon

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Add code
Jan 07, 2025
Figure 1 for AdaptVC: High Quality Voice Conversion with Adaptive Learning
Figure 2 for AdaptVC: High Quality Voice Conversion with Adaptive Learning
Figure 3 for AdaptVC: High Quality Voice Conversion with Adaptive Learning
Figure 4 for AdaptVC: High Quality Voice Conversion with Adaptive Learning
Viaarxiv icon

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

Add code
Dec 28, 2024
Figure 1 for CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Figure 2 for CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Figure 3 for CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Figure 4 for CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Viaarxiv icon

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Add code
Dec 26, 2024
Viaarxiv icon

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Add code
Nov 29, 2024
Viaarxiv icon

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Add code
Oct 23, 2024
Figure 1 for AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Figure 2 for AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Figure 3 for AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Figure 4 for AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Viaarxiv icon

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding

Add code
Oct 17, 2024
Figure 1 for Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Figure 2 for Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Figure 3 for Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Figure 4 for Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Viaarxiv icon

Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Add code
Oct 17, 2024
Figure 1 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
Figure 2 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
Figure 3 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
Figure 4 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
Viaarxiv icon