Picture for Thomas Wolf

Thomas Wolf

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Add code
Feb 04, 2025
Viaarxiv icon

Towards Best Practices for Open Datasets for LLM Training

Add code
Jan 14, 2025
Viaarxiv icon

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Add code
Jun 25, 2024
Viaarxiv icon

StarCoder 2 and The Stack v2: The Next Generation

Add code
Feb 29, 2024
Figure 1 for StarCoder 2 and The Stack v2: The Next Generation
Figure 2 for StarCoder 2 and The Stack v2: The Next Generation
Figure 3 for StarCoder 2 and The Stack v2: The Next Generation
Figure 4 for StarCoder 2 and The Stack v2: The Next Generation
Viaarxiv icon

GAIA: a benchmark for General AI Assistants

Add code
Nov 21, 2023
Figure 1 for GAIA: a benchmark for General AI Assistants
Figure 2 for GAIA: a benchmark for General AI Assistants
Figure 3 for GAIA: a benchmark for General AI Assistants
Figure 4 for GAIA: a benchmark for General AI Assistants
Viaarxiv icon

FinGPT: Large Generative Models for a Small Language

Add code
Nov 03, 2023
Figure 1 for FinGPT: Large Generative Models for a Small Language
Figure 2 for FinGPT: Large Generative Models for a Small Language
Figure 3 for FinGPT: Large Generative Models for a Small Language
Figure 4 for FinGPT: Large Generative Models for a Small Language
Viaarxiv icon

Zephyr: Direct Distillation of LM Alignment

Add code
Oct 25, 2023
Viaarxiv icon

Scaling Data-Constrained Language Models

Add code
May 25, 2023
Viaarxiv icon

StarCoder: may the source be with you!

Add code
May 09, 2023
Viaarxiv icon

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning

Add code
Feb 06, 2023
Viaarxiv icon