Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shane Steinert-Threlkeld

Minimization of Boolean Complexity in In-Context Concept Learning

Dec 03, 2024

Leroy Z. Wang, R. Thomas McCoy, Shane Steinert-Threlkeld

Abstract:What factors contribute to the relative success and corresponding difficulties of in-context learning for Large Language Models (LLMs)? Drawing on insights from the literature on human concept learning, we test LLMs on carefully designed concept learning tasks, and show that task performance highly correlates with the Boolean complexity of the concept. This suggests that in-context learning exhibits a learning bias for simplicity in a way similar to humans.

Via

Access Paper or Ask Questions

Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

May 24, 2024

Abhinav Patil, Jaap Jumelet, Yu Ying Chiu, Andy Lapastora, Peter Shen, Lexie Wang, Clevis Willrich, Shane Steinert-Threlkeld

Abstract:This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.

* 10 pages + 7 pages of references/appendices. For code and trained models, see http://github.com/CLMBRs/corpus-filtering

Via

Access Paper or Ask Questions

Targeted Multilingual Adaptation for Low-resource Language Families

May 20, 2024

C. M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld

Abstract:The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.

Via

Access Paper or Ask Questions

The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation

Mar 26, 2024

Nicolas Guerin, Shane Steinert-Threlkeld, Emmanuel Chemla

Abstract:Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.

Via

Access Paper or Ask Questions

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Sep 09, 2023

C. M. Downey, Terra Blevins, Nora Goldfine, Shane Steinert-Threlkeld

Abstract:Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed Focus method. We demonstrate that: 1) Embedding-replacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing cross-lingual vocabularies with smaller specialized ones provides an efficient method to improve performance in low-resource languages. 3) Simple embedding re-initialization techniques based on script-wise sub-distributions rival techniques such as Focus, which rely on similarity scores obtained from an auxiliary model.

Via

Access Paper or Ask Questions

Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages

Sep 02, 2023

Shunjie Wang, Shane Steinert-Threlkeld

Abstract:Despite that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test Transformer's ability to learn a variety of mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.

Via

Access Paper or Ask Questions

The Weighted Möbius Score: A Unified Framework for Feature Attribution

May 16, 2023

Yifan Jiang, Shane Steinert-Threlkeld

Abstract:Feature attribution aims to explain the reasoning behind a black-box model's prediction by identifying the impact of each feature on the prediction. Recent work has extended feature attribution to interactions between multiple features. However, the lack of a unified framework has led to a proliferation of methods that are often not directly comparable. This paper introduces a parameterized attribution framework -- the Weighted M\"obius Score -- and (i) shows that many different attribution methods for both individual features and feature interactions are special cases and (ii) identifies some new methods. By studying the vector space of attribution methods, our framework utilizes standard linear algebra tools and provides interpretations in various fields, including cooperative game theory and causal mediation analysis. We empirically demonstrate the framework's versatility and effectiveness by applying these attribution methods to feature interactions in sentiment analysis and chain-of-thought prompting.

Via

Access Paper or Ask Questions

Probing for Understanding of English Verb Classes and Alternations in Large Pre-trained Language Models

Sep 11, 2022

David K. Yi, James V. Bruno, Jiayu Han, Peter Zukerman, Shane Steinert-Threlkeld

Figure 1 for Probing for Understanding of English Verb Classes and Alternations in Large Pre-trained Language Models

Figure 2 for Probing for Understanding of English Verb Classes and Alternations in Large Pre-trained Language Models

Figure 3 for Probing for Understanding of English Verb Classes and Alternations in Large Pre-trained Language Models

Figure 4 for Probing for Understanding of English Verb Classes and Alternations in Large Pre-trained Language Models

Abstract:We investigate the extent to which verb alternation classes, as described by Levin (1993), are encoded in the embeddings of Large Pre-trained Language Models (PLMs) such as BERT, RoBERTa, ELECTRA, and DeBERTa using selectively constructed diagnostic classifiers for word and sentence-level prediction tasks. We follow and expand upon the experiments of Kann et al. (2019), which aim to probe whether static embeddings encode frame-selectional properties of verbs. At both the word and sentence level, we find that contextual embeddings from PLMs not only outperform non-contextual embeddings, but achieve astonishingly high accuracies on tasks across most alternation classes. Additionally, we find evidence that the middle-to-upper layers of PLMs achieve better performance on average than the lower layers across all probing tasks.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

Testing Pre-trained Language Models' Understanding of Distributivity via Causal Mediation Analysis

Sep 11, 2022

Pangbo Ban, Yifan Jiang, Tianran Liu, Shane Steinert-Threlkeld

Figure 1 for Testing Pre-trained Language Models' Understanding of Distributivity via Causal Mediation Analysis

Figure 2 for Testing Pre-trained Language Models' Understanding of Distributivity via Causal Mediation Analysis

Figure 3 for Testing Pre-trained Language Models' Understanding of Distributivity via Causal Mediation Analysis

Figure 4 for Testing Pre-trained Language Models' Understanding of Distributivity via Causal Mediation Analysis

Abstract:To what extent do pre-trained language models grasp semantic knowledge regarding the phenomenon of distributivity? In this paper, we introduce DistNLI, a new diagnostic dataset for natural language inference that targets the semantic difference arising from distributivity, and employ the causal mediation analysis framework to quantify the model behavior and explore the underlying mechanism in this semantically-related task. We find that the extent of models' understanding is associated with model size and vocabulary size. We also provide insights into how models encode such high-level semantic knowledge.

Via

Access Paper or Ask Questions

Learning to translate by learning to communicate

Jul 14, 2022

C. M. Downey, Leo Z. Liu, Xuhui Zhou, Shane Steinert-Threlkeld

Figure 1 for Learning to translate by learning to communicate

Figure 2 for Learning to translate by learning to communicate

Abstract:We formulate and test a technique to use Emergent Communication (EC) with a pretrained multilingual model to improve on modern Unsupervised NMT systems, especially for low-resource languages. It has been argued that the currently dominant paradigm in NLP of pretraining on text-only corpora will not yield robust natural language understanding systems, and the need for grounded, goal-oriented, and interactive language learning has been highlighted. In our approach, we embed a modern multilingual model (mBART, Liu et. al. 2020) into an EC image-reference game, in which the model is incentivized to use multilingual generations to accomplish a vision-grounded task, with the hypothesis that this will align multiple languages to a shared task space. We present two variants of EC Fine-Tuning (Steinert-Threlkeld et. al. 2022), one of which outperforms a backtranslation-based baseline in 6/8 translation settings, and proves especially beneficial for the very low-resource languages of Nepali and Sinhala.

Via

Access Paper or Ask Questions