Abstract:Transmission line (TL) models implemented in the time domain can efficiently simulate basilar-membrane (BM) displacement in response to transient or non-stationary sounds. By design, a TL model is well-suited for an one-dimensional (1-D) characterization of the traveling wave, but the real configuration of the cochlea also introduces higher-dimensional effects. Such effects include the focusing of the pressure around the BM and transverse viscous damping, both of which are magnified in the short-wave region. The two effects depend on the wavelength and are more readily expressed in the frequency domain. In this paper, we introduce a numerical correction for the BM admittance to account for 2-D effects in the time domain using autoregressive filtering and regression techniques. The correction was required for the implementation of a TL model tailored to the gerbil cochlear physiology. The model, which includes instantaneous nonlinearities in the form of variable damping, initially presented insufficient compression with increasing sound levels. This limitation was explained by the strong coupling between gain and frequency selectivity assumed in the 1-D nonlinear TL model, whereas cochlear frequency selectivity shows only a moderate dependence on sound level in small mammals. The correction factor was implemented in the gerbil model and made level-dependent using a feedback loop. The updated model achieved some decoupling between frequency selectivity and gain, providing 5 dB of additional gain and extending the range of sound levels of the compressive regime by 10 dB. We discuss the relevance of this work through two key features: the integration of both analytical and regression methods for characterizing BM admittance, and the combination of instantaneous and non-instantaneous nonlinearities.
Abstract:Languages have long been described according to their perceived rhythmic attributes. The associated typologies are of interest in psycholinguistics as they partly predict newborns' abilities to discriminate between languages and provide insights into how adult listeners process non-native languages. Despite the relative success of rhythm metrics in supporting the existence of linguistic rhythmic classes, quantitative studies have yet to capture the full complexity of temporal regularities associated with speech rhythm. We argue that deep learning offers a powerful pattern-recognition approach to advance the characterization of the acoustic bases of speech rhythm. To explore this hypothesis, we trained a medium-sized recurrent neural network on a language identification task over a large database of speech recordings in 21 languages. The network had access to the amplitude envelopes and a variable identifying the voiced segments, assuming that this signal would poorly convey phonetic information but preserve prosodic features. The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases. Visualization methods show that representations built from the network activations are consistent with speech rhythm typologies, although the resulting maps are more complex than two separated clusters between stress and syllable-timed languages. We further analyzed the model by identifying correlations between network activations and known speech rhythm metrics. The findings illustrate the potential of deep learning tools to advance our understanding of speech rhythm through the identification and exploration of linguistically relevant acoustic feature spaces.