Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mikhail Burtsev

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Feb 18, 2025

Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

Abstract:A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

Via

Access Paper or Ask Questions

Learning Elementary Cellular Automata with Transformers

Dec 02, 2024

Mikhail Burtsev

Abstract:Large Language Models demonstrate remarkable mathematical capabilities but at the same time struggle with abstract reasoning and planning. In this study, we explore whether Transformers can learn to abstract and generalize the rules governing Elementary Cellular Automata. By training Transformers on state sequences generated with random initial conditions and local rules, we show that they can generalize across different Boolean functions of fixed arity, effectively abstracting the underlying rules. While the models achieve high accuracy in next-state prediction, their performance declines sharply in multi-step planning tasks without intermediate context. Our analysis reveals that including future states or rule prediction in the training loss enhances the models' ability to form internal representations of the rules, leading to improved performance in longer planning horizons and autoregressive generation. Furthermore, we confirm that increasing the model's depth plays a crucial role in extended sequential computations required for complex reasoning tasks. This highlights the potential to improve LLM with inclusion of longer horizons in loss function, as well as incorporating recurrence and adaptive computation time for dynamic control of model depth.

Via

Access Paper or Ask Questions

Associative Recurrent Memory Transformer

Jul 05, 2024

Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, Mikhail Burtsev

Figure 1 for Associative Recurrent Memory Transformer

Figure 2 for Associative Recurrent Memory Transformer

Figure 3 for Associative Recurrent Memory Transformer

Figure 4 for Associative Recurrent Memory Transformer

Abstract:This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.

* ICML 2024 Next Generation of Sequence Modeling Architectures Workshop

Via

Access Paper or Ask Questions

AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Jul 05, 2024

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Mikhail Burtsev, Evgeny Burnaev

Abstract:Advancements in generative AI have broadened the potential applications of Large Language Models (LLMs) in the development of autonomous agents. Achieving true autonomy requires accumulating and updating knowledge gained from interactions with the environment and effectively utilizing it. Current LLM-based approaches leverage past experiences using a full history of observations, summarization or retrieval augmentation. However, these unstructured memory representations do not facilitate the reasoning and planning essential for complex decision-making. In our study, we introduce AriGraph, a novel method wherein the agent constructs a memory graph that integrates semantic and episodic memories while exploring the environment. This graph structure facilitates efficient associative retrieval of interconnected concepts, relevant to the agent's current state and goals, thus serving as an effective environmental model that enhances the agent's exploratory and planning capabilities. We demonstrate that our Ariadne LLM agent, equipped with this proposed memory architecture augmented with planning and decision-making, effectively handles complex tasks on a zero-shot basis in the TextWorld environment. Our approach markedly outperforms established methods such as full-history, summarization, and Retrieval-Augmented Generation in various tasks, including the cooking challenge from the First TextWorld Problems competition and novel tasks like house cleaning and puzzle Treasure Hunting.

* Code for this work is avaliable at https://github.com/AIRI-Institute/AriGraph

Via

Access Paper or Ask Questions

Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Jun 20, 2024

Alsu Sagirova, Mikhail Burtsev

Figure 1 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Figure 2 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Figure 3 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Figure 4 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Abstract:Even though Transformers are extensively used for Natural Language Processing tasks, especially for machine translation, they lack an explicit memory to store key concepts of processed texts. This paper explores the properties of the content of symbolic working memory added to the Transformer model decoder. Such working memory enhances the quality of model predictions in machine translation task and works as a neural-symbolic representation of information that is important for the model to make correct translations. The study of memory content revealed that translated text keywords are stored in the working memory, pointing to the relevance of memory content to the processed text. Also, the diversity of tokens and parts of speech stored in memory correlates with the complexity of the corpora for machine translation task.

* Cognitive Systems Research, Volume 75, 2022, Pages 16-24, ISSN 1389-0417
* 18 pages, 6 figures. Published in the journal Cognitive Systems Research 3 June 2022: https://www.sciencedirect.com/science/article/abs/pii/S1389041722000274

Via

Access Paper or Ask Questions

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Jun 14, 2024

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Figure 1 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 2 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 3 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 4 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Abstract:In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

Via

Access Paper or Ask Questions

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Feb 21, 2024

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Figure 1 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 2 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 3 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 4 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Abstract:This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $11\times 10^6$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.

* 11M tokens, fix qa3 min facts per task in Table 1

Via

Access Paper or Ask Questions

Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

Nov 29, 2023

Alsu Sagirova, Mikhail Burtsev

Figure 1 for Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

Figure 2 for Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

Figure 3 for Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

Figure 4 for Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

Abstract:Transformers have become the gold standard for many natural language processing tasks and, in particular, for multi-hop question answering (MHQA). This task includes processing a long document and reasoning over the multiple parts of it. The landscape of MHQA approaches can be classified into two primary categories. The first group focuses on extracting supporting evidence, thereby constraining the QA model's context to predicted facts. Conversely, the second group relies on the attention mechanism of the long input encoding model to facilitate multi-hop reasoning. However, attention-based token representations lack explicit global contextual information to connect reasoning steps. To address these issues, we propose GEMFormer, a two-stage method that first collects relevant information over the entire document to the memory and then combines it with local context to solve the task. Our experimental results show that fine-tuning a pre-trained model with memory-augmented input, including the most certain global elements, improves the model's performance on three MHQA datasets compared to the baseline. We also found that the global explicit memory contains information from supporting facts required for the correct answer.

* 12 pages, 7 figures. EMNLP 2023. Our code is available at https://github.com/Aloriosa/GEMFormer

Via

Access Paper or Ask Questions

Better Together: Enhancing Generative Knowledge Graph Completion with Language Models and Neighborhood Information

Nov 02, 2023

Alla Chepurova, Aydar Bulatov, Yuri Kuratov, Mikhail Burtsev

Abstract:Real-world Knowledge Graphs (KGs) often suffer from incompleteness, which limits their potential performance. Knowledge Graph Completion (KGC) techniques aim to address this issue. However, traditional KGC methods are computationally intensive and impractical for large-scale KGs, necessitating the learning of dense node embeddings and computing pairwise distances. Generative transformer-based language models (e.g., T5 and recent KGT5) offer a promising solution as they can predict the tail nodes directly. In this study, we propose to include node neighborhoods as additional information to improve KGC methods based on language models. We examine the effects of this imputation and show that, on both inductive and transductive Wikidata subsets, our method outperforms KGT5 and conventional KGC approaches. We also provide an extensive analysis of the impact of neighborhood on model prediction and show its importance. Furthermore, we point the way to significantly improve KGC through more effective neighborhood selection.

* Accepted to Findings of the Association for Computational Linguistics: EMNLP 2023

Via

Access Paper or Ask Questions

Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

Jul 04, 2023

Dmitry Karpov, Mikhail Burtsev

Abstract:This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks, as the Russian-only models trained on this dataset consistently yield an accuracy around 85\% on this subset. We also have figured out that for the multilingual BERT, trained on the RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of the pretraining BERT's data for the corresponding language. At the same time, the correlation of the language-wise accuracy with the linguistical distance from Russian is not statistically significant.

Via

Access Paper or Ask Questions