Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marcel Bollmann

How Good is Your Wikipedia?

Nov 08, 2024

Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, Miryam de Lhoneux

Figure 1 for How Good is Your Wikipedia?

Figure 2 for How Good is Your Wikipedia?

Figure 3 for How Good is Your Wikipedia?

Figure 4 for How Good is Your Wikipedia?

Abstract:Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.

Via

Access Paper or Ask Questions

CreoleVal: Multilingual Multitask Benchmarks for Creoles

Oct 30, 2023

Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Hans Erik Heje, Diptesh Kanojia, Paul Belony(+7 more)

Abstract:Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and other highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of brand new development datasets for machine comprehension, relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, the goal of CreoleVal is to empower research on Creoles in NLP and computational linguistics. We hope this resource will contribute to technological inclusion for Creole language users around the globe.

Via

Access Paper or Ask Questions

A Large-Scale Comparison of Historical Text Normalization Systems

Apr 03, 2019

Marcel Bollmann

Figure 1 for A Large-Scale Comparison of Historical Text Normalization Systems

Figure 2 for A Large-Scale Comparison of Historical Text Normalization Systems

Figure 3 for A Large-Scale Comparison of Historical Text Normalization Systems

Figure 4 for A Large-Scale Comparison of Historical Text Normalization Systems

Abstract:There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.

* Accepted at NAACL 2019

Via

Access Paper or Ask Questions

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Mar 12, 2019

Marcel Bollmann, Natalia Korchagina, Anders Søgaard

Figure 1 for Few-Shot and Zero-Shot Learning for Historical Text Normalization

Figure 2 for Few-Shot and Zero-Shot Learning for Historical Text Normalization

Figure 3 for Few-Shot and Zero-Shot Learning for Historical Text Normalization

Figure 4 for Few-Shot and Zero-Shot Learning for Historical Text Normalization

Abstract:Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can sometimes lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of multi-task learning strategies across different datasets from different languages. This paper evaluates 63 multi-task learning strategies for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. Finally, we show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.

Via

Access Paper or Ask Questions

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Oct 25, 2016

Marcel Bollmann, Anders Søgaard

Figure 1 for Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Figure 2 for Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Figure 3 for Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Abstract:Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.

* Accepted to COLING 2016

Via

Access Paper or Ask Questions