Picture for Pedro Ortiz Suarez

Pedro Ortiz Suarez

KréyoLID From Language Identification Towards Language Mining

Add code
Mar 09, 2025
Viaarxiv icon

Data Processing for the OpenGPT-X Model Family

Add code
Oct 11, 2024
Figure 1 for Data Processing for the OpenGPT-X Model Family
Figure 2 for Data Processing for the OpenGPT-X Model Family
Figure 3 for Data Processing for the OpenGPT-X Model Family
Figure 4 for Data Processing for the OpenGPT-X Model Family
Viaarxiv icon

Molyé: A Corpus-based Approach to Language Contact in Colonial France

Add code
Aug 08, 2024
Figure 1 for Molyé: A Corpus-based Approach to Language Contact in Colonial France
Figure 2 for Molyé: A Corpus-based Approach to Language Contact in Colonial France
Figure 3 for Molyé: A Corpus-based Approach to Language Contact in Colonial France
Figure 4 for Molyé: A Corpus-based Approach to Language Contact in Colonial France
Viaarxiv icon

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Add code
Jun 13, 2024
Figure 1 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Figure 2 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Figure 3 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Figure 4 for mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Viaarxiv icon

Tokenizer Choice For LLM Training: Negligible or Crucial?

Add code
Oct 18, 2023
Figure 1 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 2 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 3 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Figure 4 for Tokenizer Choice For LLM Training: Negligible or Crucial?
Viaarxiv icon

Semi-automatic staging area for high-quality structured data extraction from scientific literature

Add code
Sep 19, 2023
Viaarxiv icon

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Add code
Mar 07, 2023
Figure 1 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 2 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 3 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Figure 4 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Viaarxiv icon

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

Add code
Dec 20, 2022
Viaarxiv icon

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Add code
Nov 09, 2022
Viaarxiv icon

Automatic Extraction of Materials and Properties from Superconductors Scientific Literature

Add code
Oct 26, 2022
Viaarxiv icon