Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ian McLoughlin

Automated evaluation of children's speech fluency for low-resource languages

May 26, 2025

Bowen Zhang, Nur Afiqah Abdul Latiff, Justin Kan, Rong Tong, Donny Soh, Xiaoxiao Miao, Ian McLoughlin

Abstract:Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.

* 5 pages, 2 figures, conference

Via

Access Paper or Ask Questions

On-Device LLMs for SMEs: Challenges and Opportunities

Oct 21, 2024

Jeremy Stephen Gabriel Yee Zhi Wen, Pai Chet Ng, Zhengkui Wang, Ian McLoughlin, Aik Beng Ng, Simon See

Abstract:This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for effective deployment, addressing the challenges of limited computational resources typical in SME settings. From the software perspective, we explore framework compatibility, operating system optimization, and the use of specialized libraries tailored for resource-constrained environments. The review is structured to first identify the unique challenges faced by SMEs in deploying LLMs on-device, followed by an exploration of the opportunities that both hardware innovations and software adaptations offer to overcome these obstacles. Such a structured review provides practical insights, contributing significantly to the community by enhancing the technological resilience of SMEs in integrating LLMs.

* 9 pages, 1 figure. The work is supported by the SIT-NVIDIA Joint AI Centre

Via

Access Paper or Ask Questions

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

Sep 26, 2024

Pengfei Cai, Yan Song, Nan Jiang, Qing Gu, Ian McLoughlin

Abstract:A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5\%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.

* Submitted to ICASSP2025; The code for this paper will be available at https://github.com/cai525/Transformer4SED after the paper is accepted

Via

Access Paper or Ask Questions

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Aug 19, 2024

Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin

Abstract:Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.

* Received by interspeech 2024

Via

Access Paper or Ask Questions

MAT-SED: AMasked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Aug 16, 2024

Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin

* Received by interspeech 2024

Via

Access Paper or Ask Questions

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

May 20, 2023

Xiao-Min Zeng, Yan Song, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, Li-Rong Dai, Ian McLoughlin

Figure 1 for Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

Figure 2 for Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

Figure 3 for Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

Abstract:In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods.

* Accepted by ICASSP2023

Via

Access Paper or Ask Questions

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Mar 07, 2023

Kang Li, Yan Song, Li-Rong Dai, Ian McLoughlin, Xin Fang, Lin Liu

Figure 1 for AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Figure 2 for AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Figure 3 for AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Figure 4 for AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Abstract:In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences between the AT and SED tasks, it is suboptimal to directly utilize outputs from a pretrained AST model. Hence the proposed AST-SED adopts an encoder-decoder architecture to enable effective and efficient fine-tuning without needing to redesign or retrain the AST model. Specifically, the Frequency-wise Transformer Encoder (FTE) consists of transformers with self attention along the frequency axis to address multiple overlapped audio events issue in a single clip. The Local Gated Recurrent Units Decoder (LGD) consists of nearest-neighbor interpolation (NNI) and Bidirectional Gated Recurrent Units (Bi-GRU) to compensate for temporal resolution loss in the pretrained AST model output. Experimental results on DCASE2022 task4 development set have demonstrated the superiority of the proposed AST-SED with FTE-LGD architecture. Specifically, the Event-Based F1-score (EB-F1) of 59.60% and Polyphonic Sound detection Score scenario1 (PSDS1) score of 0.5140 significantly outperform CRNN and other pretrained AST-based systems.

* accepted to ICASSP 2023

Via

Access Paper or Ask Questions

A Light-weight Deep Learning Model for Remote Sensing Image Classification

Feb 25, 2023

Lam Pham, Cam Le, Dat Ngo, Anh Nguyen, Jasmin Lampert, Alexander Schindler, Ian McLoughlin

Figure 1 for A Light-weight Deep Learning Model for Remote Sensing Image Classification

Figure 2 for A Light-weight Deep Learning Model for Remote Sensing Image Classification

Figure 3 for A Light-weight Deep Learning Model for Remote Sensing Image Classification

Figure 4 for A Light-weight Deep Learning Model for Remote Sensing Image Classification

Abstract:In this paper, we present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the aerial scene of a remote sensing image. To this end, we first valuate various benchmark convolutional neural network (CNN) architectures: MobileNet V1/V2, ResNet 50/151V2, InceptionV3/InceptionResNetV2, EfficientNet B0/B7, DenseNet 121/201, ConNeXt Tiny/Large. Then, the best performing models are selected to train a compact model in a teacher-student arrangement. The knowledge distillation from the teacher aims to achieve high performance with significantly reduced complexity. By conducting extensive experiments on the NWPU-RESISC45 benchmark, our proposed teacher-student models outperforms the state-of-the-art systems, and has potential to be applied on a wide rage of edge devices.

Via

Access Paper or Ask Questions

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Jun 20, 2022

Zhifu Gao, Shiliang Zhang, Ian McLoughlin, Zhijie Yan

Figure 1 for Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Figure 2 for Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Figure 3 for Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Figure 4 for Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Abstract:Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus. There are two challenges to improving single-step NAR: Firstly to accurately predict the number of output tokens and extract hidden variables; secondly, to enhance modeling of interdependence between output tokens. To tackle both challenges, we propose a fast and accurate parallel transformer, termed Paraformer. This utilizes a continuous integrate-and-fire based predictor to predict the number of tokens and generate hidden variables. A glancing language model (GLM) sampler then generates semantic embeddings to enhance the NAR decoder's ability to model context interdependence. Finally, we design a strategy to generate negative samples for minimum word error rate training to further improve performance. Experiments using the public AISHELL-1, AISHELL-2 benchmark, and an industrial-level 20,000 hour task demonstrate that the proposed Paraformer can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.

* 5 pages, 3 figures, accepted by INTERSPEECH 2022

Via

Access Paper or Ask Questions

An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

Jan 09, 2022

Lam Pham, Dat Ngo, Truong Hoang, Alexander Schindler, Ian McLoughlin

Figure 1 for An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

Figure 2 for An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

Figure 3 for An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

Figure 4 for An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

Abstract:In this paper, we evaluate various deep learning frameworks for detecting respiratory anomalies from input audio recordings. To this end, we firstly transform audio respiratory cycles collected from patients into spectrograms where both temporal and spectral features are presented, referred to as the front-end feature extraction. We then feed the spectrograms into back-end deep learning networks for classifying these respiratory cycles into certain categories. Finally, results from high-performed deep learning frameworks are fused to obtain the best score. Our experiments on ICBHI benchmark dataset achieve the highest ICBHI score of 57.3 from a late fusion of inception based and transfer learning based deep learning frameworks, which outperforms the state-of-the-art systems.

Via

Access Paper or Ask Questions