Picture for Erik Henriksson

Erik Henriksson

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Add code
Mar 13, 2025
Viaarxiv icon

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Add code
Jan 13, 2025
Viaarxiv icon

Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers

Add code
Jun 28, 2024
Viaarxiv icon