Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicolas Ballier

UPCité, ALTAE

The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Oct 26, 2025

Siyu Liang, Nicolas Ballier, Gina-Anne Levow, Richard Wright

Figure 1 for The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Figure 2 for The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Figure 3 for The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Figure 4 for The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Abstract:How much audio is needed to fully observe a multilingual ASR model's learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper's decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model's sub-token space. Results show that the total number of discovered tokens remains largely independent of a language's pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model's hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new sub-token activation. We refer to this convergence threshold as acoustic saturation time (AST). Further analyses of rank-frequency distributions reveal Zipf-like patterns better modeled by a Zipf-Mandelbrot law, and mean sub-token length shows a positive correlation with resource level. Additionally, those metrics show more favorable patterns for languages in the Latin script than those in scripts such as Cyrillic, CJK, and Semitic. Together, our study suggests that sub-token utilization during multilingual ASR inference is constrained more by the statistical, typological, and orthographic structure of the speech than by training data scale, providing an empirical basis for more equitable corpus construction and cross-lingual evaluation.

Via

Access Paper or Ask Questions

Assessing the validity of new paradigmatic complexity measures as criterial features for proficiency in L2 writings in English

Mar 13, 2025

Cyriel Mallart, Andrew Simpkin, Nicolas Ballier, Paula Lissón, Rémi Venant, Jen-Yu Li, Bernardo Stearns, Thomas Gaillat

Figure 1 for Assessing the validity of new paradigmatic complexity measures as criterial features for proficiency in L2 writings in English

Figure 2 for Assessing the validity of new paradigmatic complexity measures as criterial features for proficiency in L2 writings in English

Figure 3 for Assessing the validity of new paradigmatic complexity measures as criterial features for proficiency in L2 writings in English

Figure 4 for Assessing the validity of new paradigmatic complexity measures as criterial features for proficiency in L2 writings in English

Abstract:This article addresses Second Language (L2) writing development through an investigation of new grammatical and structural complexity metrics. We explore the paradigmatic production in learner English by linking language functions to specific grammatical paradigms. Using the EFCAMDAT as a gold standard and a corpus of French learners as an external test set, we employ a supervised learning framework to operationalise and evaluate seven microsystems. We show that learner levels are associated with the seven microsystems (MS). Using ordinal regression modelling for evaluation, the results show that all MS are significant but yield a low impact if taken individually. However, their influence is shown to be impactful if taken as a group. These microsystems and their measurement method suggest that it is possible to use them as part of broader-purpose CALL systems focused on proficiency assessment.

* Language Learning, 2024

Via

Access Paper or Ask Questions

Fine-tuning a Subtle Parsing Distinction Using a Probabilistic Decision Tree: the Case of Postnominal "that" in Noun Complement Clauses vs. Relative Clauses

Dec 05, 2022

Zineddine Tighidet, Nicolas Ballier

Abstract:In this paper we investigated two different methods to parse relative and noun complement clauses in English and resorted to distinct tags for their corresponding that as a relative pronoun and as a complementizer. We used an algorithm to relabel a corpus parsed with the GUM Treebank using Universal Dependency. Our second experiment consisted in using TreeTagger, a Probabilistic Decision Tree, to learn the distinction between the two complement and relative uses of postnominal "that". We investigated the effect of the training set size on TreeTagger accuracy and how representative the GUM Treebank files are for the two structures under scrutiny. We discussed some of the linguistic and structural tenets of the learnability of this distinction.

* Published in the ACL anthology, ALTA 2022

Via

Access Paper or Ask Questions

Screening Gender Transfer in Neural Machine Translation

Feb 25, 2022

Guillaume Wisniewski, Lichao Zhu, Nicolas Ballier, François Yvon

Figure 1 for Screening Gender Transfer in Neural Machine Translation

Figure 2 for Screening Gender Transfer in Neural Machine Translation

Figure 3 for Screening Gender Transfer in Neural Machine Translation

Figure 4 for Screening Gender Transfer in Neural Machine Translation

Abstract:This paper aims at identifying the information flow in state-of-the-art machine translation systems, taking as example the transfer of gender when translating from French into English. Using a controlled set of examples, we experiment several ways to investigate how gender information circulates in a encoder-decoder architecture considering both probing techniques as well as interventions on the internal representations used in the MT system. Our results show that gender information can be found in all token representations built by the encoder and the decoder and lead us to conclude that there are multiple pathways for gender transfer.

* Accepted at BlackBoxNLP'2021

Via

Access Paper or Ask Questions

Approches quantitatives de l'analyse des pr{é}dictions en traduction automatique neuronale

Dec 10, 2020

Maria Zimina-Poirot, Nicolas Ballier, Jean-Baptiste Yunès

Figure 1 for Approches quantitatives de l'analyse des pr{é}dictions en traduction automatique neuronale

Figure 2 for Approches quantitatives de l'analyse des pr{é}dictions en traduction automatique neuronale

Figure 3 for Approches quantitatives de l'analyse des pr{é}dictions en traduction automatique neuronale

Figure 4 for Approches quantitatives de l'analyse des pr{é}dictions en traduction automatique neuronale

Abstract:As part of a larger project on optimal learning conditions in neural machine translation, we investigate characteristic training phases of translation engines. All our experiments are carried out using OpenNMT-Py: the pre-processing step is implemented using the Europarl training corpus and the INTERSECT corpus is used for validation. Longitudinal analyses of training phases suggest that the progression of translations is not always linear. Following the results of textometric explorations, we identify the importance of the phenomena related to chronological progression, in order to map different processes at work in neural machine translation (NMT).

* in French. JADT 2020 : 15{\`e}mes Journ{\'e}es Internationales d'Analyse statistique des Donn{\'e}es Textuelles, Universit{\'e} de Toulouse, Jun 2020, Toulouse, France

Via

Access Paper or Ask Questions

Predicting CEFRL levels in learner English on the basis of metrics and full texts

Jun 28, 2018

Taylor Arnold, Nicolas Ballier, Thomas Gaillat, Paula Lissòn

Figure 1 for Predicting CEFRL levels in learner English on the basis of metrics and full texts

Figure 2 for Predicting CEFRL levels in learner English on the basis of metrics and full texts

Abstract:This paper analyses the contribution of language metrics and, potentially, of linguistic structures, to classify French learners of English according to levels of the Common European Framework of Reference for Languages (CEFRL). The purpose is to build a model for the prediction of learner levels as a function of language complexity features. We used the EFCAMDAT corpus, a database of one million written assignments by learners. After applying language complexity metrics on the texts, we built a representation matching the language metrics of the texts to their assigned CEFRL levels. Lexical and syntactic metrics were computed with LCA, LSA, and koRpus. Several supervised learning models were built by using Gradient Boosted Trees and Keras Neural Network methods and by contrasting pairs of CEFRL levels. Results show that it is possible to implement pairwise distinctions, especially for levels ranging from A1 to B1 (A1=>A2: 0.916 AUC and A2=>B1: 0.904 AUC). Model explanation reveals significant linguistic features for the predictiveness in the corpus. Word tokens and word types appear to play a significant role in determining levels. This shows that levels are highly dependent on specific semantic profiles.

* Conference paper presented at Conf\'erence sur l'Apprentissage Automatique (CAp) 2018

Via

Access Paper or Ask Questions