Picture for Hynek Kydlíček

Hynek Kydlíček

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Add code
Jun 26, 2025
Viaarxiv icon

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Add code
Feb 18, 2025
Figure 1 for Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Figure 2 for Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Figure 3 for Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Figure 4 for Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Viaarxiv icon

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Add code
Feb 04, 2025
Figure 1 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 2 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 3 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 4 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Viaarxiv icon

Towards Best Practices for Open Datasets for LLM Training

Add code
Jan 14, 2025
Viaarxiv icon

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Add code
Jun 25, 2024
Figure 1 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Figure 2 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Figure 3 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Figure 4 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Viaarxiv icon

A Dataset and Strong Baselines for Classification of Czech News Texts

Add code
Jul 20, 2023
Figure 1 for A Dataset and Strong Baselines for Classification of Czech News Texts
Figure 2 for A Dataset and Strong Baselines for Classification of Czech News Texts
Figure 3 for A Dataset and Strong Baselines for Classification of Czech News Texts
Figure 4 for A Dataset and Strong Baselines for Classification of Czech News Texts
Viaarxiv icon