Picture for Pedro Ortiz Suarez

Pedro Ortiz Suarez

Tokenizer Choice For LLM Training: Negligible or Crucial?

Add code
Oct 18, 2023
Figure 1 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 2 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 3 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 4 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Viaarxiv icon

Semi-automatic staging area for high-quality structured data extraction from scientific literature

Add code
Sep 19, 2023
Figure 1 for Semi-automatic staging area for high-quality structured data extraction from scientific literature
Figure 2 for Semi-automatic staging area for high-quality structured data extraction from scientific literature
Figure 3 for Semi-automatic staging area for high-quality structured data extraction from scientific literature
Figure 4 for Semi-automatic staging area for high-quality structured data extraction from scientific literature
Viaarxiv icon

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Add code
Mar 07, 2023
Figure 1 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 2 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 3 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 4 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Viaarxiv icon

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

Add code
Dec 20, 2022
Figure 1 for Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
Figure 2 for Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
Figure 3 for Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
Figure 4 for Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
Viaarxiv icon

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Add code
Nov 09, 2022
Viaarxiv icon

Automatic Extraction of Materials and Properties from Superconductors Scientific Literature

Add code
Oct 26, 2022
Figure 1 for Automatic Extraction of Materials and Properties from Superconductors Scientific Literature
Figure 2 for Automatic Extraction of Materials and Properties from Superconductors Scientific Literature
Figure 3 for Automatic Extraction of Materials and Properties from Superconductors Scientific Literature
Figure 4 for Automatic Extraction of Materials and Properties from Superconductors Scientific Literature
Viaarxiv icon

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Add code
Feb 18, 2022
Figure 1 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Figure 2 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Figure 3 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Figure 4 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Viaarxiv icon

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Add code
Jan 25, 2022
Figure 1 for Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Figure 2 for Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Figure 3 for Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Figure 4 for Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Viaarxiv icon

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Add code
Jan 17, 2022
Figure 1 for Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Figure 2 for Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Figure 3 for Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Figure 4 for Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Viaarxiv icon