Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuekun Yao

Language models can learn implicit multi-hop reasoning, but only if they have lots of training data

May 23, 2025

Yuekun Yao, Yupei Du, Dawei Zhu, Michael Hahn, Alexander Koller

Abstract:Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought. We investigate this capability using GPT2-style language models trained from scratch on controlled $k$-hop reasoning datasets ($k = 2, 3, 4$). We show that while such models can indeed learn implicit $k$-hop reasoning, the required training data grows exponentially in $k$, and the required number of transformer layers grows linearly in $k$. We offer a theoretical explanation for why this depth growth is necessary. We further find that the data requirement can be mitigated, but not eliminated, through curriculum learning.

Via

Access Paper or Ask Questions

Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs

Feb 26, 2025

Xiulin Yang, Tatsuya Aoyama, Yuekun Yao, Ethan Wilcox

Abstract:Do LLMs offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LLMs can learn arbitrary inputs as easily as natural languages. In this paper, we test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 natural languages from 4 language families. Our results show that while GPT-2 small can primarily distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg's Universal 20. We find that the model's perplexity scores do not distinguish attested vs. unattested word orders, as long as the unattested variants maintain constituency structure. These findings suggest that language models exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.

Via

Access Paper or Ask Questions

Simple and effective data augmentation for compositional generalization

Jan 18, 2024

Yuekun Yao, Alexander Koller

Abstract:Compositional generalization, the ability to predict complex meanings from training on simpler sentences, poses challenges for powerful pretrained seq2seq models. In this paper, we show that data augmentation methods that sample MRs and backtranslate them can be effective for compositional generalization, but only if we sample from the right distribution. Remarkably, sampling from a uniform distribution performs almost as well as sampling from the test distribution, and greatly outperforms earlier methods that sampled from the training distribution. We further conduct experiments to investigate the reason why this happens and where the benefit of such data augmentation methods come from.

Via

Access Paper or Ask Questions

Predicting generalization performance with correctness discriminators

Nov 15, 2023

Yuekun Yao, Alexander Koller

Abstract:The ability to predict an NLP model's accuracy on unseen, potentially out-of-distribution data is a prerequisite for trustworthiness. We present a novel model that establishes upper and lower bounds on the accuracy, without requiring gold labels for the unseen data. We achieve this by training a discriminator which predicts whether the output of a given sequence-to-sequence model is correct or not. We show across a variety of tagging, parsing, and semantic parsing tasks that the gold accuracy is reliably between the predicted upper and lower bounds, and that these bounds are remarkably close together.

Via

Access Paper or Ask Questions

SLOG: A Structural Generalization Benchmark for Semantic Parsing

Oct 23, 2023

Bingzhi Li, Lucia Donatelli, Alexander Koller, Tal Linzen, Yuekun Yao, Najoung Kim

Abstract:The goal of compositional generalization benchmarks is to evaluate how well models generalize to new complex linguistic expressions. Existing benchmarks often focus on lexical generalization, the interpretation of novel lexical items in syntactic structures familiar from training; structural generalization tasks, where a model needs to interpret syntactic structures that are themselves unfamiliar from training, are often underrepresented, resulting in overly optimistic perceptions of how well models can generalize. We introduce SLOG, a semantic parsing dataset that extends COGS (Kim and Linzen, 2020) with 17 structural generalization cases. In our experiments, the generalization accuracy of Transformer models, including pretrained ones, only reaches 40.6%, while a structure-aware parser only achieves 70.8%. These results are far from the near-perfect accuracy existing models achieve on COGS, demonstrating the role of SLOG in foregrounding the large discrepancy between models' lexical and structural generalization capacities.

* Accepted to EMNLP 2023

Via

Access Paper or Ask Questions

Structural generalization is hard for sequence-to-sequence models

Oct 24, 2022

Yuekun Yao, Alexander Koller

Abstract:Sequence-to-sequence (seq2seq) models have been successful across many NLP tasks, including ones that require predicting linguistic structure. However, recent work on compositional generalization has shown that seq2seq models achieve very low accuracy in generalizing to linguistic structures that were not seen in training. We present new evidence that this is a general limitation of seq2seq models that is present not just in semantic parsing, but also in syntactic parsing and in text-to-text tasks, and that this limitation can often be overcome by neurosymbolic models that have linguistic knowledge built in. We further report on some experiments that give initial answers on the reasons for these limitations.

* Accepted in EMNLP 2022

Via

Access Paper or Ask Questions

Compositional Generalization Requires Compositional Parsers

Feb 24, 2022

Pia Weißenhorn, Yuekun Yao, Lucia Donatelli, Alexander Koller

Figure 1 for Compositional Generalization Requires Compositional Parsers

Figure 2 for Compositional Generalization Requires Compositional Parsers

Figure 3 for Compositional Generalization Requires Compositional Parsers

Figure 4 for Compositional Generalization Requires Compositional Parsers

Abstract:A rapidly growing body of research on compositional generalization investigates the ability of a semantic parser to dynamically recombine linguistic elements seen in training into unseen sequences. We present a systematic comparison of sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus (Kim and Linzen, 2020). Though seq2seq models can perform well on lexical tasks, they perform with near-zero accuracy on structural generalization tasks that require novel syntactic structures; this holds true even when they are trained to predict syntax instead of semantics. In contrast, compositional models achieve near-perfect accuracy on structural generalization; we present new results confirming this from the AM parser (Groschwitz et al., 2021). Our findings show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.

Via

Access Paper or Ask Questions

ELITR Non-Native Speech Translation at IWSLT 2020

Jun 05, 2020

Dominik Macháček, Jonáš Kratochvíl, Sangeet Sagar, Matúš Žilinec, Ondřej Bojar, Thai-Son Nguyen, Felix Schneider, Philip Williams, Yuekun Yao

Figure 1 for ELITR Non-Native Speech Translation at IWSLT 2020

Figure 2 for ELITR Non-Native Speech Translation at IWSLT 2020

Figure 3 for ELITR Non-Native Speech Translation at IWSLT 2020

Abstract:This paper is an ELITR system submission for the non-native speech translation task at IWSLT 2020. We describe systems for offline ASR, real-time ASR, and our cascaded approach to offline SLT and real-time SLT. We select our primary candidates from a pool of pre-existing systems, develop a new end-to-end general ASR system, and a hybrid ASR trained on non-native speech. The provided small validation set prevents us from carrying out a complex validation, but we submit all the unselected candidates for contrastive evaluation on the test set.

* IWSLT 2020

Via

Access Paper or Ask Questions

Dynamic Masking for Improved Stability in Spoken Language Translation

May 30, 2020

Yuekun Yao, Barry Haddow

Figure 1 for Dynamic Masking for Improved Stability in Spoken Language Translation

Figure 2 for Dynamic Masking for Improved Stability in Spoken Language Translation

Figure 3 for Dynamic Masking for Improved Stability in Spoken Language Translation

Figure 4 for Dynamic Masking for Improved Stability in Spoken Language Translation

Abstract:For spoken language translation (SLT) in live scenarios such as conferences, lectures and meetings, it is desirable to show the translation to the user as quickly as possible, avoiding an annoying lag between speaker and translated captions. In other words, we would like low-latency, online SLT. If we assume a pipeline of automatic speech recognition (ASR) and machine translation (MT) then a viable approach to online SLT is to pair an online ASR system, with a a retranslation strategy, where the MT system re-translates every update received from ASR. However this can result in annoying "flicker" as the MT system updates its translation. A possible solution is to add a fixed delay, or "mask" to the the output of the MT system, but a fixed global mask introduces undesirable latency to the output. We show how this mask can be set dynamically, improving the latency-flicker trade-off without sacrificing translation quality.

Via

Access Paper or Ask Questions