Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ariel Ekgren

SWEb: A Large Web Dataset for the Scandinavian Languages

Oct 06, 2024

Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren

Figure 1 for SWEb: A Large Web Dataset for the Scandinavian Languages

Figure 2 for SWEb: A Large Web Dataset for the Scandinavian Languages

Figure 3 for SWEb: A Large Web Dataset for the Scandinavian Languages

Figure 4 for SWEb: A Large Web Dataset for the Scandinavian Languages

Abstract:This paper presents the hitherto largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb), comprising over one trillion tokens. The paper details the collection and processing pipeline, and introduces a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches. We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results. All data, models and code are shared openly.

Via

Access Paper or Ask Questions

GPT-SW3: An Autoregressive Language Model for the Nordic Languages

May 23, 2023

Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk, Joey Öhman, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Alice Heiman, Judit Casademont, Magnus Sahlgren

Abstract:This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.

Via

Access Paper or Ask Questions

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Mar 30, 2023

Joey Öhman, Severine Verlinden, Ariel Ekgren, Amaru Cuba Gyllensten, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Magnus Sahlgren

Abstract:Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.

Via

Access Paper or Ask Questions

Cross-lingual Transfer of Monolingual Models

Sep 15, 2021

Evangelia Gogoulou, Ariel Ekgren, Tim Isbister, Magnus Sahlgren

Figure 1 for Cross-lingual Transfer of Monolingual Models

Figure 2 for Cross-lingual Transfer of Monolingual Models

Figure 3 for Cross-lingual Transfer of Monolingual Models

Figure 4 for Cross-lingual Transfer of Monolingual Models

Abstract:Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform the native English model independently of the source language. After probing the English linguistic knowledge encoded in the representations before and after transfer, we find that semantic information is retained from the source language, while syntactic information is learned during transfer. Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.

Via

Access Paper or Ask Questions

R-grams: Unsupervised Learning of Semantic Units in Natural Language

Aug 14, 2018

Ariel Ekgren, Amaru Cuba Gyllensten, Magnus Sahlgren

Figure 1 for R-grams: Unsupervised Learning of Semantic Units in Natural Language

Figure 2 for R-grams: Unsupervised Learning of Semantic Units in Natural Language

Figure 3 for R-grams: Unsupervised Learning of Semantic Units in Natural Language

Figure 4 for R-grams: Unsupervised Learning of Semantic Units in Natural Language

Abstract:This paper introduces a novel type of data-driven segmented unit that we call r-grams. We illustrate one algorithm for calculating r-grams, and discuss its properties and impact on the frequency distribution of text representations. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

Via

Access Paper or Ask Questions