Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanjeel Parekh

LTCI

Learning to Highlight Audio by Watching Movies

May 17, 2025

Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh

Abstract:Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.

* CVPR 2025. Project page: https://wikichao.github.io/VisAH/

Via

Access Paper or Ask Questions

Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

Jan 30, 2025

Joanna Hong, Sanjeel Parekh, Honglie Chen, Jacob Donley, Ke Tan, Buye Xu, Anurag Kumar

Figure 1 for Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

Figure 2 for Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

Figure 3 for Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

Figure 4 for Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

Abstract:Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during inference. This innovative approach facilitates the integration of information across different modalities, enhancing the overall inference process by leveraging the strengths of each modality to compensate for the absence of certain modalities during inference. We apply MUTUD to various audiovisual speech tasks and show that it can reduce the performance gap between the multimodal and corresponding unimodal models to a considerable extent. MUTUD can achieve this while reducing the model size and compute compared to multimodal models, in some cases by almost 80%.

Via

Access Paper or Ask Questions

Tackling Interpretability in Audio Classification Networks with Non-negative Matrix Factorization

May 11, 2023

Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Gaël Richard, Florence d'Alché-Buc

Abstract:This paper tackles two major problem settings for interpretability of audio processing networks, post-hoc and by-design interpretation. For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. This is extended to present an inherently interpretable model with high performance. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, an interpreter is trained to generate a regularized intermediate embedding from hidden layers of a target network, learnt as time-activations of a pre-learnt NMF dictionary. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on a variety of classification tasks, including multi-label data for real-world audio and music.

* Under submission at IEEE/ACM TASLP. arXiv admin note: text overlap with arXiv:2202.11479

Via

Access Paper or Ask Questions

Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Feb 23, 2022

Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Florence d'Alché-Buc, Gaël Richard

Figure 1 for Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Figure 2 for Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Figure 3 for Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Figure 4 for Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Abstract:This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.

Via

Access Paper or Ask Questions

Emotion Transfer Using Vector-Valued Infinite Task Learning

Feb 09, 2021

Alex Lambert, Sanjeel Parekh, Zoltán Szabó, Florence d'Alché-Buc

Figure 1 for Emotion Transfer Using Vector-Valued Infinite Task Learning

Figure 2 for Emotion Transfer Using Vector-Valued Infinite Task Learning

Figure 3 for Emotion Transfer Using Vector-Valued Infinite Task Learning

Figure 4 for Emotion Transfer Using Vector-Valued Infinite Task Learning

Abstract:Style transfer is a significant problem of machine learning with numerous successful applications. In this work, we present a novel style transfer framework building upon infinite task learning and vector-valued reproducing kernel Hilbert spaces. We instantiate the idea in emotion transfer where the goal is to transform facial images to different target emotions. The proposed approach provides a principled way to gain explicit control over the continuous style space. We demonstrate the efficiency of the technique on popular facial emotion benchmarks, achieving low reconstruction cost and high emotion classification accuracy.

* 17 pages, 10 figures

Via

Access Paper or Ask Questions

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Nov 09, 2018

Sanjeel Parekh, Alexey Ozerov, Slim Essid, Ngoc Duong, Patrick Pérez, Gaël Richard

Figure 1 for Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Figure 2 for Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Figure 3 for Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Figure 4 for Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Abstract:We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.

Via

Access Paper or Ask Questions

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Jul 09, 2018

Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez, Gaël Richard

Figure 1 for Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Figure 2 for Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Figure 3 for Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Figure 4 for Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Abstract:Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audio-visual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our system's efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.

Via

Access Paper or Ask Questions

Content-based Video Indexing and Retrieval Using Corr-LDA

Feb 27, 2016

Rahul Radhakrishnan Iyer, Sanjeel Parekh, Vikas Mohandoss, Anush Ramsurat, Bhiksha Raj, Rita Singh

Figure 1 for Content-based Video Indexing and Retrieval Using Corr-LDA

Figure 2 for Content-based Video Indexing and Retrieval Using Corr-LDA

Figure 3 for Content-based Video Indexing and Retrieval Using Corr-LDA

Figure 4 for Content-based Video Indexing and Retrieval Using Corr-LDA

Abstract:Existing video indexing and retrieval methods on popular web-based multimedia sharing websites are based on user-provided sparse tagging. This paper proposes a very specific way of searching for video clips, based on the content of the video. We present our work on Content-based Video Indexing and Retrieval using the Correspondence-Latent Dirichlet Allocation (corr-LDA) probabilistic framework. This is a model that provides for auto-annotation of videos in a database with textual descriptors, and brings the added benefit of utilizing the semantic relations between the content of the video and text. We use the concept-level matching provided by corr-LDA to build correspondences between text and multimedia, with the objective of retrieving content with increased accuracy. In our experiments, we employ only the audio components of the individual recordings and compare our results with an SVM-based approach.

* 7 pages

Via

Access Paper or Ask Questions