Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Oli Danyi Liu

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Jun 13, 2024

Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater

Figure 1 for Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Figure 2 for Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Figure 3 for Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Abstract:Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.

* Accepted to Interspeech

Via

Access Paper or Ask Questions

A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

May 13, 2024

Oli Danyi Liu, Hao Tang, Naomi Feldman, Sharon Goldwater

Figure 1 for A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Figure 2 for A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Figure 3 for A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Figure 4 for A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Abstract:Speech perception involves storing and integrating sequentially presented items. Recent work in cognitive neuroscience has identified temporal and contextual characteristics in humans' neural encoding of speech that may facilitate this temporal processing. In this study, we simulated similar analyses with representations extracted from a computational model that was trained on unlabelled speech with the learning objective of predicting upcoming acoustics. Our simulations revealed temporal dynamics similar to those in brain signals, implying that these properties can arise without linguistic knowledge. Another property shared between brains and the model is that the encoding patterns of phonemes support some degree of cross-context generalization. However, we found evidence that the effectiveness of these generalizations depends on the specific contexts, which suggests that this analysis alone is insufficient to support the presence of context-invariant encoding.

* Accepted to CogSci 2024

Via

Access Paper or Ask Questions