Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aydar Bulatov

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Mar 14, 2026

Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev

Abstract:Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Via

Access Paper or Ask Questions

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Feb 18, 2025

Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

Abstract:A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

Via

Access Paper or Ask Questions

Long Input Benchmark for Russian Analysis

Aug 05, 2024

Igor Churin, Murat Apishev, Maria Tikhonova, Denis Shevelev, Aydar Bulatov, Yuri Kuratov, Sergej Averkiev, Alena Fenogenova

Figure 1 for Long Input Benchmark for Russian Analysis

Figure 2 for Long Input Benchmark for Russian Analysis

Figure 3 for Long Input Benchmark for Russian Analysis

Figure 4 for Long Input Benchmark for Russian Analysis

Abstract:Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with long text documents and to process long sequences of tokens. This has created a demand for proper evaluation of long-context understanding. To address this need for the Russian language, we propose LIBRA (Long Input Benchmark for Russian Analysis), which comprises 21 adapted datasets to study the LLM's abilities to understand long texts thoroughly. The tests are divided into four complexity groups and allow the evaluation of models across various context lengths ranging from 4k up to 128k tokens. We provide the open-source datasets, codebase, and public leaderboard for LIBRA to guide forthcoming research.

Via

Access Paper or Ask Questions

Associative Recurrent Memory Transformer

Jul 05, 2024

Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, Mikhail Burtsev

Figure 1 for Associative Recurrent Memory Transformer

Figure 2 for Associative Recurrent Memory Transformer

Figure 3 for Associative Recurrent Memory Transformer

Figure 4 for Associative Recurrent Memory Transformer

Abstract:This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.

* ICML 2024 Next Generation of Sequence Modeling Architectures Workshop

Via

Access Paper or Ask Questions

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Jun 14, 2024

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Figure 1 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 2 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 3 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 4 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Abstract:In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

Via

Access Paper or Ask Questions

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Feb 21, 2024

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Figure 1 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 2 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 3 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 4 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Abstract:This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $11\times 10^6$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.

* 11M tokens, fix qa3 min facts per task in Table 1

Via

Access Paper or Ask Questions

Better Together: Enhancing Generative Knowledge Graph Completion with Language Models and Neighborhood Information

Nov 02, 2023

Alla Chepurova, Aydar Bulatov, Yuri Kuratov, Mikhail Burtsev

Abstract:Real-world Knowledge Graphs (KGs) often suffer from incompleteness, which limits their potential performance. Knowledge Graph Completion (KGC) techniques aim to address this issue. However, traditional KGC methods are computationally intensive and impractical for large-scale KGs, necessitating the learning of dense node embeddings and computing pairwise distances. Generative transformer-based language models (e.g., T5 and recent KGT5) offer a promising solution as they can predict the tail nodes directly. In this study, we propose to include node neighborhoods as additional information to improve KGC methods based on language models. We examine the effects of this imputation and show that, on both inductive and transductive Wikidata subsets, our method outperforms KGT5 and conventional KGC approaches. We also provide an extensive analysis of the impact of neighborhood on model prediction and show its importance. Furthermore, we point the way to significantly improve KGC through more effective neighborhood selection.

* Accepted to Findings of the Association for Computational Linguistics: EMNLP 2023

Via

Access Paper or Ask Questions

Scaling Transformer to 1M tokens and beyond with RMT

Apr 19, 2023

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Figure 1 for Scaling Transformer to 1M tokens and beyond with RMT

Figure 2 for Scaling Transformer to 1M tokens and beyond with RMT

Figure 3 for Scaling Transformer to 1M tokens and beyond with RMT

Figure 4 for Scaling Transformer to 1M tokens and beyond with RMT

Abstract:This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and global information and enables information flow between segments of the input sequence through the use of recurrence. Our experiments demonstrate the effectiveness of our approach, which holds significant potential to enhance long-term dependency handling in natural language understanding and generation tasks as well as enable large-scale context processing for memory-intensive applications.

Via

Access Paper or Ask Questions

Recurrent Memory Transformer

Jul 14, 2022

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Figure 1 for Recurrent Memory Transformer

Figure 2 for Recurrent Memory Transformer

Figure 3 for Recurrent Memory Transformer

Figure 4 for Recurrent Memory Transformer

Abstract:Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then Transformer is trained to control both memory operations and sequence representations processing. Results of experiments show that our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve it performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.

Via

Access Paper or Ask Questions