Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Aug 22, 2024

Luyao Cheng, Hui Wang, Siqi Zheng, Yafeng Chen, Rongjie Huang, Qinglin Zhang, Qian Chen, Xihao Li

Figure 1 for Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Figure 2 for Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Figure 3 for Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Figure 4 for Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Share this with someone who'll enjoy it:

Abstract:Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals. Recent studies have made tremendous efforts towards audio-visual or audio-semantic modeling to enhance performance. However, even the incorporation of up to two modalities often falls short in addressing the complexities of spontaneous and unstructured conversations. To exploit more meaningful dialogue patterns, we propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization. Our method elegantly formulates the multimodal modeling as a constrained optimization problem. First, we build insights into the visual connections among active speakers and the semantic interactions within spoken content, thereby establishing abundant pairwise constraints. Then we introduce a joint pairwise constraint propagation algorithm to cluster speakers based on these visual and semantic constraints. This integration effectively leverages the complementary strengths of different modalities, refining the affinity estimation between individual speaker embeddings. Extensive experiments conducted on multiple multimodal datasets demonstrate that our approach consistently outperforms state-of-the-art speaker diarization methods.

View paper on

Share this with someone who'll enjoy it:

Title:Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Paper and Code