Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anna Aksenova

MiniLingua: A Small Open-Source LLM for European Languages

Dec 15, 2025

Anna Aksenova, Boris Zverkov, Nicola Dainese, Alexander Nikitin, Pekka Marttinen

Abstract:Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.

* 9+6 pages, 6 figures and 3 tables in the main text. Code at https://github.com/MiniLingua-ai/training_artifacts

Via

Access Paper or Ask Questions

RuDSI: graph-based word sense induction dataset for Russian

Sep 28, 2022

Anna Aksenova, Ekaterina Gavrishina, Elisey Rykov, Andrey Kutuzov

Figure 1 for RuDSI: graph-based word sense induction dataset for Russian

Figure 2 for RuDSI: graph-based word sense induction dataset for Russian

Figure 3 for RuDSI: graph-based word sense induction dataset for Russian

Figure 4 for RuDSI: graph-based word sense induction dataset for Russian

Abstract:We present RuDSI, a new benchmark for word sense induction (WSI) in Russian. The dataset was created using manual annotation and semi-automatic clustering of Word Usage Graphs (WUGs). Unlike prior WSI datasets for Russian, RuDSI is completely data-driven (based on texts from Russian National Corpus), with no external word senses imposed on annotators. Depending on the parameters of graph clustering, different derivative datasets can be produced from raw annotation. We report the performance that several baseline WSI methods obtain on RuDSI and discuss possibilities for improving these scores.

* TextGraphs-16 workshop at the CoLING-2022 conference

Via

Access Paper or Ask Questions