Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Feb 26, 2025

Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, Tal Linzen

Figure 1 for Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Figure 2 for Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Figure 3 for Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Figure 4 for Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Share this with someone who'll enjoy it:

Abstract:Pretraining language models on formal languages can improve their acquisition of natural language, but it is unclear which features of the formal language impart an inductive bias that leads to effective transfer. Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when the formal language both captures dependency structures in natural language and remains within the computational limitations of the model architecture. Focusing on transformers, we find that formal languages with both these properties enable language models to achieve lower loss on natural language and better linguistic generalization compared to other languages. In fact, pre-pretraining, or training on formal-then-natural language, reduces loss more efficiently than the same amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. We also give mechanistic evidence of cross-task transfer from formal to natural language: attention heads acquired during formal language pretraining remain crucial for the model's performance on syntactic evaluations.

View paper on

Share this with someone who'll enjoy it:

Title:Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Paper and Code