Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sourish Chaudhuri

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Jan 05, 2019

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi(+1 more)

Figure 1 for AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Figure 2 for AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Figure 3 for AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Figure 4 for AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Abstract:Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.

Via

Access Paper or Ask Questions

Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

May 31, 2017

Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, Ian Sturdy

Figure 1 for Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

Figure 2 for Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

Figure 3 for Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

Figure 4 for Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

Abstract:In this paper, we present a system that associates faces with voices in a video by fusing information from the audio and visual signals. The thesis underlying our work is that an extremely simple approach to generating (weak) speech clusters can be combined with visual signals to effectively associate faces and voices by aggregating statistics across a video. This approach does not need any training data specific to this task and leverages the natural coherence of information in the audio and visual streams. It is particularly applicable to tracking speakers in videos on the web where a priori information about the environment (e.g., number of speakers, spatial signals for beamforming) is not available. We performed experiments on a real-world dataset using this analysis framework to determine the speaker in a video. Given a ground truth labeling determined by human rater consensus, our approach had ~71% accuracy.

Via

Access Paper or Ask Questions

CNN Architectures for Large-Scale Audio Classification

Jan 10, 2017

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold(+3 more)

Figure 1 for CNN Architectures for Large-Scale Audio Classification

Figure 2 for CNN Architectures for Large-Scale Audio Classification

Figure 3 for CNN Architectures for Large-Scale Audio Classification

Figure 4 for CNN Architectures for Large-Scale Audio Classification

Abstract:Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

* Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new additions

Via

Access Paper or Ask Questions

Plagiarism Detection in Polyphonic Music using Monaural Signal Separation

Feb 27, 2015

Soham De, Indradyumna Roy, Tarunima Prabhakar, Kriti Suneja, Sourish Chaudhuri, Rita Singh, Bhiksha Raj

Figure 1 for Plagiarism Detection in Polyphonic Music using Monaural Signal Separation

Figure 2 for Plagiarism Detection in Polyphonic Music using Monaural Signal Separation

Abstract:Given the large number of new musical tracks released each year, automated approaches to plagiarism detection are essential to help us track potential violations of copyright. Most current approaches to plagiarism detection are based on musical similarity measures, which typically ignore the issue of polyphony in music. We present a novel feature space for audio derived from compositional modelling techniques, commonly used in signal separation, that provides a mechanism to account for polyphony without incurring an inordinate amount of computational overhead. We employ this feature representation in conjunction with traditional audio feature representations in a classification framework which uses an ensemble of distance features to characterize pairs of songs as being plagiarized or not. Our experiments on a database of about 3000 musical track pairs show that the new feature space characterization produces significant improvements over standard baselines.

* INTERSPEECH-2012, 1744-1747 (2012)
* Preprint version

Via

Access Paper or Ask Questions