Picture for Zhijie Yan

Zhijie Yan

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Add code
Dec 13, 2024
Viaarxiv icon

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

Add code
Oct 17, 2024
Figure 1 for Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation
Figure 2 for Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation
Figure 3 for Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation
Figure 4 for Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation
Viaarxiv icon

Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study

Add code
Sep 26, 2024
Figure 1 for Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study
Figure 2 for Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study
Figure 3 for Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study
Figure 4 for Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study
Viaarxiv icon

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Add code
Jul 09, 2024
Viaarxiv icon

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Add code
Mar 28, 2024
Figure 1 for TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
Figure 2 for TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
Figure 3 for TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
Figure 4 for TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
Viaarxiv icon

Large Language Models Powered Context-aware Motion Prediction

Add code
Mar 17, 2024
Figure 1 for Large Language Models Powered Context-aware Motion Prediction
Figure 2 for Large Language Models Powered Context-aware Motion Prediction
Figure 3 for Large Language Models Powered Context-aware Motion Prediction
Figure 4 for Large Language Models Powered Context-aware Motion Prediction
Viaarxiv icon

Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

Add code
Dec 19, 2023
Viaarxiv icon

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Add code
Nov 14, 2023
Viaarxiv icon

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Add code
Oct 11, 2023
Viaarxiv icon

The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Add code
Sep 24, 2023
Viaarxiv icon