Abstract:This paper introduces mmWave-Whisper, a system that demonstrates the feasibility of full-corpus automated speech recognition (ASR) on phone calls eavesdropped remotely using off-the-shelf frequency modulated continuous wave (FMCW) millimeter-wave radars. Operating in the 77-81 GHz range, mmWave-Whisper captures earpiece vibrations from smartphones, converts them into audio, and processes the audio to produce speech transcriptions automatically. Unlike previous work that focused on loudspeakers or limited vocabulary, this is the first work to perform such a speech recognition by handling large vocabulary and full sentences on earpiece vibrations from smartphones. This approach expands the potential of radar-audio eavesdropping. mmWave-Whisper addresses challenges such as the lack of large scale training datasets, low SNR, and limited frequency information in radar data through a systematic pipeline designed to leverage synthetic training data, domain adaptation, and inference by incorporating OpenAI's Whisper automatic speech recognition model. The system achieves a word accuracy rate of 44.74% and a character accuracy rate of 62.52% over a range of 25 cm to 125 cm. The paper highlights emerging misuse modalities of AI as the technology evolves rapidly.
Abstract:Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are independently embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a simple yet effective backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an imperceptible, temporally distributed trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.