Abstract:This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named \textsc{Llammas}, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.
Abstract:This study explores enhancing grammatical error correction (GEC) through artificial error generation (AEG) using language models (LMs). Specifically, we fine-tune Llama 2-based LMs for error generation and find that this approach yields synthetic errors akin to human errors. Next, we train GEC Llama models with the help of these artificial errors and outperform previous state-of-the-art error correction models, with gains ranging between 0.8 and 6 F0.5 points across all tested languages (German, Ukrainian, and Estonian). Moreover, we demonstrate that generating errors by fine-tuning smaller sequence-to-sequence models and prompting large commercial LMs (GPT-3.5 and GPT-4) also results in synthetic errors beneficially affecting error generation models.
Abstract:The project was funded in 2021-2023 by the National Programme of Estonian Language Technology. Its main aim was to develop spelling and grammar correction tools for the Estonian language. The main challenge was the very small amount of available error correction data needed for such development. To mitigate this, (1) we annotated more correction data for model training and testing, (2) we tested transfer-learning, i.e. retraining machine learning models created for other tasks, so as not to depend solely on correction data, (3) we compared the developed method and model with alternatives, including large language models. We also developed automatic evaluation, which can calculate the accuracy and yield of corrections by error category, so that the effectiveness of different methods can be compared in detail. There has been a breakthrough in large language models during the project: GPT4, a commercial language model with Estonian-language support, has been created. We took into account the existence of the model when adjusting plans and in the report we present a comparison with the ability of GPT4 to improve the Estonian language text. The final results show that the approach we have developed provides better scores than GPT4 and the result is usable but not entirely reliable yet. The report also contains ideas on how GPT4 and other major language models can be implemented in the future, focusing on open-source solutions. All results of this project are open-data/open-source, with licenses that allow them to be used for purposes including commercial ones.
Abstract:Both grammatical error correction and text style transfer can be viewed as monolingual sequence-to-sequence transformation tasks, but the scarcity of directly annotated data for either task makes them unfeasible for most languages. We present an approach that does both tasks within the same trained model, and only uses regular language parallel data, without requiring error-corrected or style-adapted texts. We apply our model to three languages and present a thorough evaluation on both tasks, showing that the model is reliable for a number of error types and style transfer aspects.