Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Podolskiy

Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

Dec 25, 2025

Alexander Podolskiy, Semen Molokov, Timofey Gerasin, Maksim Titov, Alexey Rukhovich, Artem Khrapov, Kirill Morozov, Evgeny Tetin, Constantine Korikov, Pavel Efimov(+7 more)

Abstract:We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

Via

Access Paper or Ask Questions

Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

Jan 26, 2025

Alexey Rukhovich, Alexander Podolskiy, Irina Piontkovskaya

Figure 1 for Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

Figure 2 for Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

Figure 3 for Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

Figure 4 for Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

Abstract:In multi-domain learning, a single model is trained on diverse data domains to leverage shared knowledge and improve generalization. The order in which the data from these domains is used for training can significantly affect the model's performance on each domain. However, this dependence is under-studied. In this paper, we investigate the influence of training order (or data mixing) in multi-domain learning using the concept of Lie bracket of gradient vector fields. By analyzing the infinitesimal effects of changing the training order, we identify regions in the parameter space where altering the order between two training domains can benefit the target loss. We validate the predictions of our theoretical framework on the influence of training order (or data mixing) both on a toy example and bilingual LLM pre-training.

* NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning

Via

Access Paper or Ask Questions

Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Nov 20, 2023

Andrey Bout, Alexander Podolskiy, Sergey Nikolenko, Irina Piontkovskaya

Figure 1 for Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Figure 2 for Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Figure 3 for Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Figure 4 for Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Abstract:Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic data, pretraining on it, and then fine-tuning on real datasets; performance gains have been achieved either by ensembling or by using huge pretrained models such as XXL-T5 as the backbone. In this work, we explore an orthogonal direction: how to use available data more efficiently. First, we propose auxiliary tasks that exploit the alignment between the original and corrected sentences, such as predicting a sequence of corrections. We formulate each task as a sequence-to-sequence problem and perform multi-task training. Second, we discover that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance, so we set out to find the best training schedule. Together, these two ideas lead to significant improvements, producing results that improve state of the art with much smaller models; in particular, we outperform the best models based on T5-XXL (11B parameters) with a BART-based model (400M parameters).

* EMNLP 2023

Via

Access Paper or Ask Questions

GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding

Nov 14, 2023

Konstantin Yakovlev, Alexander Podolskiy, Andrey Bout, Sergey Nikolenko, Irina Piontkovskaya

Figure 1 for GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding

Figure 2 for GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding

Figure 3 for GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding

Figure 4 for GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding

Abstract:Grammatical error correction (GEC) is an important NLP task that is currently usually solved with autoregressive sequence-to-sequence models. However, approaches of this class are inherently slow due to one-by-one token generation, so non-autoregressive alternatives are needed. In this work, we propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network that outputs a self-attention weight matrix that can be used in beam search to find the best permutation of input tokens (with auxiliary {ins} tokens) and a decoder network based on a step-unrolled denoising autoencoder that fills in specific tokens. This allows us to find the token permutation after only one forward pass of the permutation network, avoiding autoregressive constructions. We show that the resulting network improves over previously known non-autoregressive methods for GEC and reaches the level of autoregressive methods that do not use language-specific synthetic data generation methods. Our results are supported by a comprehensive experimental validation on the ConLL-2014 and Write&Improve+LOCNESS datasets and an extensive ablation study that supports our architectural and algorithmic choices.

* ACL 2023

Via

Access Paper or Ask Questions

Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

Nov 14, 2023

Konstantin Yakovlev, Gregory Polyakov, Ilseyar Alimova, Alexander Podolskiy, Andrey Bout, Sergey Nikolenko, Irina Piontkovskaya

Figure 1 for Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

Figure 2 for Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

Figure 3 for Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

Figure 4 for Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

Abstract:A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.

* SIGIR 2023

Via

Access Paper or Ask Questions

PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Mar 20, 2023

Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov(+7 more)

Figure 1 for PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Figure 2 for PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Figure 3 for PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Figure 4 for PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Abstract:The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework, and present the language model with 1.085T parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha}, we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the model over 329B tokens by using Expert Computation and Storage Separation(ECSS). This resulted in a 6.3x increase in training throughput through heterogeneous computing. Our experimental findings show that PanGu-{\Sigma} provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks. Moreover, it demonstrates strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.

Via

Access Paper or Ask Questions

Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution

Jun 07, 2022

Nikolay Arefyev, Boris Sheludko, Alexander Podolskiy, Alexander Panchenko

Figure 1 for Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution

Figure 2 for Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution

Figure 3 for Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution

Figure 4 for Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution

Abstract:Lexical substitution, i.e. generation of plausible words that can replace a particular target word in a given context, is an extremely powerful technology that can be used as a backbone of various NLP applications, including word sense induction and disambiguation, lexical relation extraction, data augmentation, etc. In this paper, we present a large-scale comparative study of lexical substitution methods employing both rather old and most recent language and masked language models (LMs and MLMs), such as context2vec, ELMo, BERT, RoBERTa, XLNet. We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly. Several existing and new target word injection methods are compared for each LM/MLM using both intrinsic evaluation on lexical substitution datasets and extrinsic evaluation on word sense induction (WSI) datasets. On two WSI datasets we obtain new SOTA results. Besides, we analyze the types of semantic relations between target words and their substitutes generated by different models or given by annotators.

* Proceedings of the 28th International Conference on Computational Linguistics, pages 1242-1255, Barcelona, Spain (Online). International Committee on Computational Linguistics. 2022
* arXiv admin note: text overlap with arXiv:2006.00031

Via

Access Paper or Ask Questions

Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

Jan 11, 2021

Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, Irina Piontkovskaya

Abstract:Real-life applications, heavily relying on machine learning, such as dialog systems, demand out-of-domain detection methods. Intent classification models should be equipped with a mechanism to distinguish seen intents from unseen ones so that the dialog agent is capable of rejecting the latter and avoiding undesired behavior. However, despite increasing attention paid to the task, the best practices for out-of-domain intent detection have not yet been fully established. This paper conducts a thorough comparison of out-of-domain intent detection methods. We prioritize the methods, not requiring access to out-of-domain data during training, gathering of which is extremely time- and labor-consuming due to lexical and stylistic variation of user utterances. We evaluate multiple contextual encoders and methods, proven to be efficient, on three standard datasets for intent classification, expanded with out-of-domain utterances. Our main findings show that fine-tuning Transformer-based encoders on in-domain data leads to superior results. Mahalanobis distance, together with utterance representations, derived from Transformer-based encoders, outperforms other methods by a wide margin and establishes new state-of-the-art results for all datasets. The broader analysis shows that the reason for success lies in the fact that the fine-tuned Transformer is capable of constructing homogeneous representations of in-domain utterances, revealing geometrical disparity to out of domain utterances. In turn, the Mahalanobis distance captures this disparity easily.

* to appear in AAAI 2021

Via

Access Paper or Ask Questions

A Comparative Study of Lexical Substitution Approaches based on Neural Language Models

May 29, 2020

Nikolay Arefyev, Boris Sheludko, Alexander Podolskiy, Alexander Panchenko

Figure 1 for A Comparative Study of Lexical Substitution Approaches based on Neural Language Models

Figure 2 for A Comparative Study of Lexical Substitution Approaches based on Neural Language Models

Figure 3 for A Comparative Study of Lexical Substitution Approaches based on Neural Language Models

Figure 4 for A Comparative Study of Lexical Substitution Approaches based on Neural Language Models

Abstract:Lexical substitution in context is an extremely powerful technology that can be used as a backbone of various NLP applications, such as word sense induction, lexical relation extraction, data augmentation, etc. In this paper, we present a large-scale comparative study of popular neural language and masked language models (LMs and MLMs), such as context2vec, ELMo, BERT, XLNet, applied to the task of lexical substitution. We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly, and compare several target injection methods. In addition, we provide analysis of the types of semantic relations between the target and substitutes generated by different models providing insights into what kind of words are really generated or given by annotators as substitutes.

Via

Access Paper or Ask Questions