Picture for Heqing Zou

Heqing Zou

Xiao Jie

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Add code
Jan 03, 2025
Figure 1 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Figure 2 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Figure 3 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Figure 4 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Viaarxiv icon

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Add code
Sep 27, 2024
Figure 1 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Figure 2 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Figure 3 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Figure 4 for From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Viaarxiv icon

Text-based Talking Video Editing with Cascaded Conditional Diffusion

Add code
Jul 20, 2024
Figure 1 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Figure 2 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Figure 3 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Figure 4 for Text-based Talking Video Editing with Cascaded Conditional Diffusion
Viaarxiv icon

MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition

Add code
Jun 18, 2023
Figure 1 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Figure 2 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Figure 3 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Figure 4 for MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Viaarxiv icon

Towards Balanced Active Learning for Multimodal Classification

Add code
Jun 14, 2023
Viaarxiv icon

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Add code
May 16, 2023
Figure 1 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Figure 2 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Figure 3 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Figure 4 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Viaarxiv icon

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

Add code
May 16, 2023
Viaarxiv icon

Unsupervised Noise adaptation using Data Simulation

Add code
Feb 23, 2023
Viaarxiv icon

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Add code
Feb 22, 2023
Viaarxiv icon

Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Add code
Dec 10, 2022
Viaarxiv icon