Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Agnes Luhtaru

Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer

Apr 05, 2024

Hele-Andra Kuulmets, Taido Purason, Agnes Luhtaru, Mark Fishel

Abstract:This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named \textsc{Llammas}, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.

Via

Access Paper or Ask Questions

To Err Is Human, but Llamas Can Learn It Too

Mar 08, 2024

Agnes Luhtaru, Taido Purason, Martin Vainikko, Maksym Del, Mark Fishel

Figure 1 for To Err Is Human, but Llamas Can Learn It Too

Figure 2 for To Err Is Human, but Llamas Can Learn It Too

Figure 3 for To Err Is Human, but Llamas Can Learn It Too

Figure 4 for To Err Is Human, but Llamas Can Learn It Too

Abstract:This study explores enhancing grammatical error correction (GEC) through artificial error generation (AEG) using language models (LMs). Specifically, we fine-tune Llama 2-based LMs for error generation and find that this approach yields synthetic errors akin to human errors. Next, we train GEC Llama models with the help of these artificial errors and outperform previous state-of-the-art error correction models, with gains ranging between 0.8 and 6 F0.5 points across all tested languages (German, Ukrainian, and Estonian). Moreover, we demonstrate that generating errors by fine-tuning smaller sequence-to-sequence models and prompting large commercial LMs (GPT-3.5 and GPT-4) also results in synthetic errors beneficially affecting error generation models.

Via

Access Paper or Ask Questions

Autocorrect for Estonian texts: final report from project EKTB25

Feb 18, 2024

Agnes Luhtaru, Martin Vainikko, Krista Liin, Kais Allkivi-Metsoja, Jaagup Kippar, Pille Eslon, Mark Fishel

Abstract:The project was funded in 2021-2023 by the National Programme of Estonian Language Technology. Its main aim was to develop spelling and grammar correction tools for the Estonian language. The main challenge was the very small amount of available error correction data needed for such development. To mitigate this, (1) we annotated more correction data for model training and testing, (2) we tested transfer-learning, i.e. retraining machine learning models created for other tasks, so as not to depend solely on correction data, (3) we compared the developed method and model with alternatives, including large language models. We also developed automatic evaluation, which can calculate the accuracy and yield of corrections by error category, so that the effectiveness of different methods can be compared in detail. There has been a breakthrough in large language models during the project: GPT4, a commercial language model with Estonian-language support, has been created. We took into account the existence of the model when adjusting plans and in the report we present a comparison with the ability of GPT4 to improve the Estonian language text. The final results show that the approach we have developed provides better scores than GPT4 and the result is usable but not entirely reliable yet. The report also contains ideas on how GPT4 and other major language models can be implemented in the future, focusing on open-source solutions. All results of this project are open-data/open-source, with licenses that allow them to be used for purposes including commercial ones.

* in Estonian language

Via

Access Paper or Ask Questions

Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Mar 27, 2019

Elizaveta Korotkova, Agnes Luhtaru, Maksym Del, Krista Liin, Daiga Deksne, Mark Fishel

Figure 1 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Figure 2 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Figure 3 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Figure 4 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Abstract:Both grammatical error correction and text style transfer can be viewed as monolingual sequence-to-sequence transformation tasks, but the scarcity of directly annotated data for either task makes them unfeasible for most languages. We present an approach that does both tasks within the same trained model, and only uses regular language parallel data, without requiring error-corrected or style-adapted texts. We apply our model to three languages and present a thorough evaluation on both tasks, showing that the model is reliable for a number of error types and style transfer aspects.

Via

Access Paper or Ask Questions