Picture for Pedro Ortiz Suarez

Pedro Ortiz Suarez

Data Processing for the OpenGPT-X Model Family

Add code
Oct 11, 2024
Figure 1 for Data Processing for the OpenGPT-X Model Family
Figure 2 for Data Processing for the OpenGPT-X Model Family
Figure 3 for Data Processing for the OpenGPT-X Model Family
Figure 4 for Data Processing for the OpenGPT-X Model Family
Viaarxiv icon

Molyé: A Corpus-based Approach to Language Contact in Colonial France

Add code
Aug 08, 2024
Viaarxiv icon

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Add code
Jun 13, 2024
Figure 1 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Figure 2 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Figure 3 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Figure 4 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Viaarxiv icon

Tokenizer Choice For LLM Training: Negligible or Crucial?

Add code
Oct 18, 2023
Figure 1 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 2 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 3 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 4 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Viaarxiv icon

Semi-automatic staging area for high-quality structured data extraction from scientific literature

Add code
Sep 19, 2023
Viaarxiv icon

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Add code
Mar 07, 2023
Figure 1 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 2 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 3 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 4 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Viaarxiv icon

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

Add code
Dec 20, 2022
Viaarxiv icon

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Add code
Nov 09, 2022
Viaarxiv icon

Automatic Extraction of Materials and Properties from Superconductors Scientific Literature

Add code
Oct 26, 2022
Viaarxiv icon

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Add code
Feb 18, 2022
Figure 1 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Figure 2 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Figure 3 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Figure 4 for From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
Viaarxiv icon