Abstract:Brain-to-speech decoding models demonstrate robust performance in vocalized, mimed, and imagined speech; yet, the fundamental mechanisms via which these models capture and transmit information across different speech modalities are less explored. In this work, we use mechanistic interpretability to causally investigate the internal representations of a neural speech decoder. We perform cross-mode activation patching of internal activations across speech modes, and use tri-modal interpolation to examine whether speech representations vary discretely or continuously. We use coarse-to-fine causal tracing and causal scrubbing to find localized causal structure, allowing us to find internal subspaces that are sufficient for cross-mode transfer. In order to determine how finely distributed these effects are within layers, we perform neuron-level activation patching. We discover that small but not distributed subsets of neurons, rather than isolated units, affect the cross-mode transfer. Our results show that speech modes lie on a shared continuous causal manifold, and cross-mode transfer is mediated by compact, layer-specific subspaces rather than diffuse activity. Together, our findings give a causal explanation for how speech modality information is organized and used in brain-to-speech decoding models, revealing hierarchical and direction-dependent representational structure across speech modes.
Abstract:Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI's Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72^\circ$-a substantial improvement compared to the 88.52$^\circ$ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$^\circ$. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.