Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stefan Ruseti

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

May 21, 2025

Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

Abstract:Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

* 1 Table, 8 Figures

Via

Access Paper or Ask Questions

How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

Oct 04, 2024

Adrian Cosma, Stefan Ruseti, Mihai Dascalu, Cornelia Caragea

Abstract:Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.

* Accepted at EMNLP 2024 Main Conference

Via

Access Paper or Ask Questions

Romanian Diacritics Restoration Using Recurrent Neural Networks

Sep 06, 2020

Stefan Ruseti, Teodor-Mihai Cotet, Mihai Dascalu

Figure 1 for Romanian Diacritics Restoration Using Recurrent Neural Networks

Figure 2 for Romanian Diacritics Restoration Using Recurrent Neural Networks

Abstract:Diacritics restoration is a mandatory step for adequately processing Romanian texts, and not a trivial one, as you generally need context in order to properly restore a character. Most previous methods which were experimented for Romanian restoration of diacritics do not use neural networks. Among those that do, there are no solutions specifically optimized for this particular language (i.e., they were generally designed to work on many different languages). Therefore we propose a novel neural architecture based on recurrent neural networks that can attend information at different levels of abstractions in order to restore diacritics.

* 2 pages, 1 figure

Via

Access Paper or Ask Questions

Answering questions by learning to rank -- Learning to rank by answering questions

Sep 02, 2019

George-Sebastian Pîrtoacă, Traian Rebedea, Stefan Ruseti

Figure 1 for Answering questions by learning to rank -- Learning to rank by answering questions

Figure 2 for Answering questions by learning to rank -- Learning to rank by answering questions

Figure 3 for Answering questions by learning to rank -- Learning to rank by answering questions

Figure 4 for Answering questions by learning to rank -- Learning to rank by answering questions

Abstract:Answering multiple-choice questions in a setting in which no supporting documents are explicitly provided continues to stand as a core problem in natural language processing. The contribution of this article is two-fold. First, it describes a method which can be used to semantically rank documents extracted from Wikipedia or similar natural language corpora. Second, we propose a model employing the semantic ranking that holds the first place in two of the most popular leaderboards for answering multiple-choice questions: ARC Easy and Challenge. To achieve this, we introduce a self-attention based neural network that latently learns to rank documents by their importance related to a given question, whilst optimizing the objective of predicting the correct answer. These documents are considered relevant contexts for the underlying question. We have published the ranked documents so that they can be used off-the-shelf to improve downstream decision models.

* Accepted at EMNLP 2019; 10 pages, 5 figures

Via

Access Paper or Ask Questions

Improving Retrieval-Based Question Answering with Deep Inference Models

Dec 07, 2018

George-Sebastian Pirtoaca, Traian Rebedea, Stefan Ruseti

Figure 1 for Improving Retrieval-Based Question Answering with Deep Inference Models

Figure 2 for Improving Retrieval-Based Question Answering with Deep Inference Models

Figure 3 for Improving Retrieval-Based Question Answering with Deep Inference Models

Figure 4 for Improving Retrieval-Based Question Answering with Deep Inference Models

Abstract:Question answering is one of the most important and difficult applications at the border of information retrieval and natural language processing, especially when we talk about complex science questions which require some form of inference to determine the correct answer. In this paper, we present a two-step method that combines information retrieval techniques optimized for question answering with deep learning models for natural language inference in order to tackle the multi-choice question answering in the science domain. For each question-answer pair, we use standard retrieval-based models to find relevant candidate contexts and decompose the main problem into two different sub-problems. First, assign correctness scores for each candidate answer based on the context using retrieval models from Lucene. Second, we use deep learning architectures to compute if a candidate answer can be inferred from some well-chosen context consisting of sentences retrieved from the knowledge base. In the end, all these solvers are combined using a simple neural network to predict the correct answer. This proposed two-step model outperforms the best retrieval-based solver by over 3% in absolute accuracy.

Via

Access Paper or Ask Questions