Abstract:Personal Voice Activity Detection (PVAD) is crucial for identifying target speaker segments in the mixture, yet its performance heavily depends on the quality of speaker embeddings. A key practical limitation is the short enrollment speech--such as a wake-up word--which provides limited cues. This paper proposes a novel adaptive speaker embedding self-augmentation strategy that enhances PVAD performance by augmenting the original enrollment embeddings through additive fusion of keyframe embeddings extracted from mixed speech. Furthermore, we introduce a long-term adaptation strategy to iteratively refine embeddings during detection, mitigating speaker temporal variability. Experiments show significant gains in recall, precision, and F1-score under short enrollment conditions, matching full-length enrollment performance after five iterative updates. The source code is available at https://anonymous.4open.science/r/ASE-PVAD-E5D6 .



Abstract:Speech enhancement (SE) aims to improve the clarity, intelligibility, and quality of speech signals for various speech enabled applications. However, air-conducted (AC) speech is highly susceptible to ambient noise, particularly in low signal-to-noise ratio (SNR) and non-stationary noise environments. Incorporating multi-modal information has shown promise in enhancing speech in such challenging scenarios. Electromyography (EMG) signals, which capture muscle activity during speech production, offer noise-resistant properties beneficial for SE in adverse conditions. Most previous EMG-based SE methods required 35 EMG channels, limiting their practicality. To address this, we propose a novel method that considers only 8-channel EMG signals with acoustic signals using a modified SEMamba network with added cross-modality modules. Our experiments demonstrate substantial improvements in speech quality and intelligibility over traditional approaches, especially in extremely low SNR settings. Notably, compared to the SE (AC) approach, our method achieves a significant PESQ gain of 0.235 under matched low SNR conditions and 0.527 under mismatched conditions, highlighting its robustness.