Picture for Heqing Zou

Heqing Zou

Xiao Jie

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Add code
Jan 03, 2025
Figure 1 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Figure 2 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Figure 3 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Figure 4 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Viaarxiv icon

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Add code
Sep 27, 2024
Figure 1 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Figure 2 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Figure 3 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Figure 4 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Viaarxiv icon

Text-based Talking Video Editing with Cascaded Conditional Diffusion

Add code
Jul 20, 2024
Figure 1 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Figure 2 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Figure 3 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Figure 4 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Viaarxiv icon

MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition

Add code
Jun 18, 2023
Figure 1 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Figure 2 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Figure 3 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Figure 4 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Viaarxiv icon

Towards Balanced Active Learning for Multimodal Classification

Add code
Jun 14, 2023
Viaarxiv icon

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

Add code
May 16, 2023
Viaarxiv icon

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Add code
May 16, 2023
Figure 1 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Figure 2 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Figure 3 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Figure 4 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Viaarxiv icon

Unsupervised Noise adaptation using Data Simulation

Add code
Feb 23, 2023
Viaarxiv icon

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Add code
Feb 22, 2023
Figure 1 for Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation
Figure 2 for Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation
Figure 3 for Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation
Figure 4 for Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation
Viaarxiv icon

Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Add code
Dec 10, 2022
Viaarxiv icon