Picture for Wei-Ning Hsu

Wei-Ning Hsu

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Add code
Nov 07, 2024
Viaarxiv icon

MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

Add code
Oct 27, 2024
Figure 1 for MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
Figure 2 for MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
Figure 3 for MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
Figure 4 for MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
Viaarxiv icon

Movie Gen: A Cast of Media Foundation Models

Add code
Oct 17, 2024
Figure 1 for Movie Gen: A Cast of Media Foundation Models
Figure 2 for Movie Gen: A Cast of Media Foundation Models
Figure 3 for Movie Gen: A Cast of Media Foundation Models
Figure 4 for Movie Gen: A Cast of Media Foundation Models
Viaarxiv icon

High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

Add code
Jul 04, 2024
Figure 1 for High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
Figure 2 for High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
Figure 3 for High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
Figure 4 for High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
Viaarxiv icon

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Add code
Jun 13, 2024
Figure 1 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Figure 2 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Figure 3 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Figure 4 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Viaarxiv icon

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Add code
Jun 10, 2024
Viaarxiv icon

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Add code
Apr 16, 2024
Viaarxiv icon

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

Add code
Mar 21, 2024
Figure 1 for XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Figure 2 for XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Figure 3 for XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Figure 4 for XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Viaarxiv icon

Audiobox: Unified Audio Generation with Natural Language Prompts

Add code
Dec 25, 2023
Viaarxiv icon

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Add code
Nov 05, 2023
Figure 1 for Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency
Figure 2 for Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency
Figure 3 for Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency
Viaarxiv icon