Picture for Xize Cheng

Xize Cheng

A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter

Add code
Dec 12, 2024
Viaarxiv icon

WavChat: A Survey of Spoken Dialogue Models

Add code
Nov 26, 2024
Viaarxiv icon

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Add code
Oct 28, 2024
Figure 1 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Figure 2 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Figure 3 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Figure 4 for OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Viaarxiv icon

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Add code
Oct 16, 2024
Viaarxiv icon

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Add code
Sep 05, 2024
Figure 1 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
Figure 2 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
Figure 3 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
Figure 4 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
Viaarxiv icon

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Add code
Aug 29, 2024
Figure 1 for WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Figure 2 for WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Figure 3 for WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Figure 4 for WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Viaarxiv icon

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Add code
Aug 03, 2024
Figure 1 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation
Figure 2 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation
Figure 3 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation
Figure 4 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation
Viaarxiv icon

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Add code
Jul 16, 2024
Viaarxiv icon

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Add code
Jun 25, 2024
Figure 1 for ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling
Figure 2 for ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling
Figure 3 for ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling
Figure 4 for ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling
Viaarxiv icon

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

Add code
Jun 03, 2024
Viaarxiv icon