Picture for Guilherme Penedo

Guilherme Penedo

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Add code
Feb 04, 2025
Viaarxiv icon

Towards Best Practices for Open Datasets for LLM Training

Add code
Jan 14, 2025
Viaarxiv icon

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Add code
Jun 25, 2024
Viaarxiv icon

The Falcon Series of Open Language Models

Add code
Nov 29, 2023
Figure 1 for The Falcon Series of Open Language Models
Figure 2 for The Falcon Series of Open Language Models
Figure 3 for The Falcon Series of Open Language Models
Figure 4 for The Falcon Series of Open Language Models
Viaarxiv icon

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Add code
Jun 01, 2023
Figure 1 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 2 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 3 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 4 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Viaarxiv icon