Picture for Paulo Villegas

Paulo Villegas

StarCoder: may the source be with you!

Add code
May 09, 2023
Viaarxiv icon

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Add code
Mar 07, 2023
Figure 1 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 2 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 3 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 4 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Viaarxiv icon

The ROOTS Search Tool: Data Transparency for LLMs

Add code
Feb 27, 2023
Figure 1 for The ROOTS Search Tool: Data Transparency for LLMs
Figure 2 for The ROOTS Search Tool: Data Transparency for LLMs
Figure 3 for The ROOTS Search Tool: Data Transparency for LLMs
Viaarxiv icon

SantaCoder: don't reach for the stars!

Add code
Jan 09, 2023
Viaarxiv icon

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Add code
Nov 09, 2022
Viaarxiv icon

BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

Add code
Jul 14, 2022
Figure 1 for BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Figure 2 for BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Figure 3 for BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Figure 4 for BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Viaarxiv icon