Abstract:Continual learning for automatic speech recognition (ASR) systems poses a challenge, especially with the need to avoid catastrophic forgetting while maintaining performance on previously learned tasks. This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). By incorporating a text-to-speech (TTS) component within the machine speech chain, we support the replay mechanism essential for GEM, allowing the ASR model to learn new tasks sequentially without significant performance degradation on earlier tasks. Our experiments, conducted on the LJ Speech dataset, demonstrate that our method outperforms traditional fine-tuning and multitask learning approaches, achieving a substantial error rate reduction while maintaining high performance across varying noise conditions. We showed the potential of our semi-supervised machine speech chain approach for effective and efficient continual learning in speech recognition.
Abstract:Audio-based pornographic detection enables efficient adult content filtering without sacrificing performance by exploiting distinct spectral characteristics. To improve it, we explore pornographic sound modeling based on different neural architectures and acoustic features. We find that CNN trained on log mel spectrogram achieves the best performance on Pornography-800 dataset. Our experiment results also show that log mel spectrogram allows better representations for the models to recognize pornographic sounds. Finally, to classify whole audio waveforms rather than segments, we employ voting segment-to-audio technique that yields the best audio-level detection results.