Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Julien Epps

Why Pre-trained Models Fail: Feature Entanglement in Multi-modal Depression Detection

Mar 09, 2025

Xiangyu Zhang, Beena Ahmed, Julien Epps

Abstract:Depression remains a pressing global mental health issue, driving considerable research into AI-driven detection approaches. While pre-trained models, particularly speech self-supervised models (SSL Models), have been applied to depression detection, they show unexpectedly poor performance without extensive data augmentation. Large Language Models (LLMs), despite their success across various domains, have not been explored in multi-modal depression detection. In this paper, we first establish an LLM-based system to investigate its potential in this task, uncovering fundamental limitations in handling multi-modal information. Through systematic analysis, we discover that the poor performance of pre-trained models stems from the conflation of high-level information, where high-level features derived from both content and speech are mixed within pre-trained models model representations, making it challenging to establish effective decision boundaries. To address this, we propose an information separation framework that disentangles these features, significantly improving the performance of both SSL models and LLMs in depression detection. Our experiments validate this finding and demonstrate that the integration of separated features yields substantial improvements over existing approaches, providing new insights for developing more effective multi-modal depression detection systems.

Via

Access Paper or Ask Questions

Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction

Sep 12, 2024

Xiangyu Zhang, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tuende Szalay, Mostafa Shahin, Beena Ahmed, Julien Epps

Figure 1 for Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction

Figure 2 for Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction

Figure 3 for Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction

Figure 4 for Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction

Abstract:In the speech signal, acoustic landmarks identify times when the acoustic manifestations of the linguistically motivated distinctive features are most salient. Acoustic landmarks have been widely applied in various domains, including speech recognition, speech depression detection, clinical analysis of speech abnormalities, and the detection of disordered speech. However, there is currently no dataset available that provides precise timing information for landmarks, which has been proven to be crucial for downstream applications involving landmarks. In this paper, we selected the most useful acoustic landmarks based on previous research and annotated the TIMIT dataset with them, based on a combination of phoneme boundary information and manual inspection. Moreover, previous landmark extraction tools were not open source or benchmarked, so to address this, we developed an open source Python-based landmark extraction tool and established a series of landmark detection baselines. The first of their kinds, the dataset with landmark precise timing information, landmark extraction tool and baselines are designed to support a wide variety of future research.

Via

Access Paper or Ask Questions

Rethinking Mamba in Speech Processing by Self-Supervised Models

Sep 11, 2024

Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps

Figure 1 for Rethinking Mamba in Speech Processing by Self-Supervised Models

Figure 2 for Rethinking Mamba in Speech Processing by Self-Supervised Models

Figure 3 for Rethinking Mamba in Speech Processing by Self-Supervised Models

Figure 4 for Rethinking Mamba in Speech Processing by Self-Supervised Models

Abstract:The Mamba-based model has demonstrated outstanding performance across tasks in computer vision, natural language processing, and speech processing. However, in the realm of speech processing, the Mamba-based model's performance varies across different tasks. For instance, in tasks such as speech enhancement and spectrum reconstruction, the Mamba model performs well when used independently. However, for tasks like speech recognition, additional modules are required to surpass the performance of attention-based models. We propose the hypothesis that the Mamba-based model excels in "reconstruction" tasks within speech processing. However, for "classification tasks" such as Speech Recognition, additional modules are necessary to accomplish the "reconstruction" step. To validate our hypothesis, we analyze the previous Mamba-based Speech Models from an information theory perspective. Furthermore, we leveraged the properties of HuBERT in our study. We trained a Mamba-based HuBERT model, and the mutual information patterns, along with the model's performance metrics, confirmed our assumptions.

Via

Access Paper or Ask Questions

Mamba in Speech: Towards an Alternative to Self-Attention

May 22, 2024

Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien Epps

Figure 1 for Mamba in Speech: Towards an Alternative to Self-Attention

Figure 2 for Mamba in Speech: Towards an Alternative to Self-Attention

Figure 3 for Mamba in Speech: Towards an Alternative to Self-Attention

Figure 4 for Mamba in Speech: Towards an Alternative to Self-Attention

Abstract:Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing using two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The results exhibit the superiority of bidirectional Mamba (BiMamba) for speech processing to vanilla Mamba. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in Transformer and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section to offer insights for future research.

Via

Access Paper or Ask Questions

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Feb 17, 2024

Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps

Figure 1 for When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Figure 2 for When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Figure 3 for When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Figure 4 for When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Abstract:Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Among various AI technologies, Large Language Models (LLMs) stand out for their versatility in mental healthcare applications. However, their primary limitation arises from their exclusive dependence on textual input, which constrains their overall capabilities. Furthermore, the utilization of LLMs in identifying and analyzing depressive states is still relatively untapped. In this paper, we present an innovative approach to integrating acoustic speech information into the LLMs framework for multimodal depression detection. We investigate an efficient method for depression detection by integrating speech signals into LLMs utilizing Acoustic Landmarks. By incorporating acoustic landmarks, which are specific to the pronunciation of spoken words, our method adds critical dimensions to text transcripts. This integration also provides insights into the unique speech patterns of individuals, revealing the potential mental states of individuals. Evaluations of the proposed approach on the DAIC-WOZ dataset reveal state-of-the-art results when compared with existing Audio-Text baselines. In addition, this approach is not only valuable for the detection of depression but also represents a new perspective in enhancing the ability of LLMs to comprehend and process speech signals.

Via

Access Paper or Ask Questions

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Nov 13, 2023

Mostafa Shahin, Julien Epps, Beena Ahmed

Figure 1 for Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Figure 2 for Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Figure 3 for Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Figure 4 for Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Abstract:The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent.

Via

Access Paper or Ask Questions

Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Oct 19, 2022

Mostafa Shahin, Beena Ahmed, Julien Epps

Figure 1 for Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Figure 2 for Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Figure 3 for Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Figure 4 for Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Abstract:One of the major challenges in acoustic modelling of child speech is the rapid changes that occur in the children's articulators as they grow up, their differing growth rates and the subsequent high variability in the same age group. These high acoustic variations along with the scarcity of child speech corpora have impeded the development of a reliable speech recognition system for children. In this paper, a speaker- and age-invariant training approach based on adversarial multi-task learning is proposed. The system consists of one generator shared network that learns to generate speaker- and age-invariant features connected to three discrimination networks, for phoneme, age, and speaker. The generator network is trained to minimize the phoneme-discrimination loss and maximize the speaker- and age-discrimination losses in an adversarial multi-task learning fashion. The generator network is a Time Delay Neural Network (TDNN) architecture while the three discriminators are feed-forward networks. The system was applied to the OGI speech corpora and achieved a 13% reduction in the WER of the ASR.

* Submitted to ICASSP2023

Via

Access Paper or Ask Questions

The Ambiguous World of Emotion Representation

Sep 01, 2019

Vidhyasaharan Sethu, Emily Mower Provost, Julien Epps, Carlos Busso, Nicholas Cummins, Shrikanth Narayanan

Figure 1 for The Ambiguous World of Emotion Representation

Figure 2 for The Ambiguous World of Emotion Representation

Figure 3 for The Ambiguous World of Emotion Representation

Figure 4 for The Ambiguous World of Emotion Representation

Abstract:Artificial intelligence and machine learning systems have demonstrated huge improvements and human-level parity in a range of activities, including speech recognition, face recognition and speaker verification. However, these diverse tasks share a key commonality that is not true in affective computing: the ground truth information that is inferred can be unambiguously represented. This observation provides some hints as to why affective computing, despite having attracted the attention of researchers for years, may not still be considered a mature field of research. A key reason for this is the lack of a common mathematical framework to describe all the relevant elements of emotion representations. This paper proposes the AMBiguous Emotion Representation (AMBER) framework to address this deficiency. AMBER is a unified framework that explicitly describes categorical, numerical and ordinal representations of emotions, including time varying representations. In addition to explaining the core elements of AMBER, the paper also discusses how some of the commonly employed emotion representation schemes can be viewed through the AMBER framework, and concludes with a discussion of how the proposed framework can be used to reason about current and future affective computing systems.

Via

Access Paper or Ask Questions

Transfer Learning for Improving Speech Emotion Classification Accuracy

Mar 26, 2018

Siddique Latif, Rajib Rana, Shahzad Younis, Junaid Qadir, Julien Epps

Figure 1 for Transfer Learning for Improving Speech Emotion Classification Accuracy

Figure 2 for Transfer Learning for Improving Speech Emotion Classification Accuracy

Figure 3 for Transfer Learning for Improving Speech Emotion Classification Accuracy

Figure 4 for Transfer Learning for Improving Speech Emotion Classification Accuracy

Abstract:The majority of existing speech emotion recognition research focuses on automatic emotion detection using training and testing data from same corpus collected under the same conditions. The performance of such systems has been shown to drop significantly in cross-corpus and cross-language scenarios. To address the problem, this paper exploits a transfer learning technique to improve the performance of speech emotion recognition systems that is novel in cross-language and cross-corpus scenarios. Evaluations on five different corpora in three different languages show that Deep Belief Networks (DBNs) offer better accuracy than previous approaches on cross-corpus emotion recognition, relative to a Sparse Autoencoder and SVM baseline system. Results also suggest that using a large number of languages for training and using a small fraction of the target data in training can significantly boost accuracy compared with baseline also for the corpus with limited training examples.

Via

Access Paper or Ask Questions

Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

Mar 26, 2018

Siddique Latif, Rajib Rana, Junaid Qadir, Julien Epps

Figure 1 for Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

Figure 2 for Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

Figure 3 for Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

Figure 4 for Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

Abstract:Learning the latent representation of data in unsupervised fashion is a very interesting process that provides relevant features for enhancing the performance of a classifier. For speech emotion recognition tasks, generating effective features is crucial. Currently, handcrafted features are mostly used for speech emotion recognition, however, features learned automatically using deep learning have shown strong success in many problems, especially in image processing. In particular, deep generative models such as Variational Autoencoders (VAEs) have gained enormous success for generating features for natural images. Inspired by this, we propose VAEs for deriving the latent representation of speech signals and use this representation to classify emotions. To the best of our knowledge, we are the first to propose VAEs for speech emotion classification. Evaluations on the IEMOCAP dataset demonstrate that features learned by VAEs can produce state-of-the-art results for speech emotion classification.

* 4 pages

Via

Access Paper or Ask Questions