Abstract:This paper presents a novel streaming end-to-end target-speaker speech recognition that addresses two critical limitations in systems: the handling of noisy enrollment utterances and specific enrollment phrase requirements. This paper proposes a robust Target-Speaker Recurrent Neural Network Transducer (TS-RNNT) with dual attention mechanisms for contextual biasing and overlapping enrollment processing. The model incorporates a text decoder and attention mechanism specifically designed to extract relevant speaker characteristics from noisy, overlapping enrollment audio. Experimental results on a synthesized dataset demonstrate the model's resilience, maintaining a Word Error Rate (WER) of 16.44% even with overlapping enrollment at 5dB Signal-to-Interference Ratio (SIR), compared to conventional approaches that degrade to WERs above 75% under similar conditions. This significant performance improvement, coupled with the model's semi-text-dependent enrollment capabilities, represents a substantial advancement toward more practical and versatile voice-controlled devices.
Abstract:In this paper, we investigate a new machine learning-based transmission strategy called progressive transmission or ProgTr. In ProgTr, there are b variables that should be transmitted using at most T channel uses. The transmitter aims to send the data to the receiver as fast as possible and with as few channel uses as possible (as channel conditions permit) while the receiver refines its estimate after each channel use. We use recurrent neural networks as the building block of both the transmitter and receiver where the SNR is provided as an input that represents the channel conditions. To show how ProgTr works, the proposed scheme was simulated in different scenarios including single/multi-user settings, different channel conditions, and for both discrete and continuous input data. The results show that ProgTr can achieve better performance compared to conventional modulation methods. In addition to performance metrics such as BER, bit-wise mutual information is used to provide some interpretation to how the transmitter and receiver operate in ProgTr.
Abstract:Knowledge of the channel state information (CSI) at the transmitter side is one of the primary sources of information that can be used for efficient allocation of wireless resources. Obtaining Down-Link (DL) CSI in FDD systems from Up-Link (UL) CSI is not as straightforward as TDD systems, and so usually users feedback the DL-CSI to the transmitter. To remove the need for feedback (and thus having less signaling overhead), several methods have been studied to estimate DL-CSI from UL-CSI. In this paper, we propose a scheme to infer DL-CSI by observing UL-CSI in which we use two recent deep neural network structures: a) Convolutional Neural network and b) Generative Adversarial Networks. The proposed deep network structures are first learning a latent model of the environment from the training data. Then, the resulted latent model is used to predict the DL-CSI from the UL-CSI. We have simulated the proposed scheme and evaluated its performance in a few network settings.