Picture for Luca Soldaini

Luca Soldaini

Amazon Alexa Search

NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

Add code
Feb 05, 2026
Viaarxiv icon

Bolmo: Byteifying the Next Generation of Language Models

Add code
Dec 17, 2025
Figure 1 for Bolmo: Byteifying the Next Generation of Language Models
Figure 2 for Bolmo: Byteifying the Next Generation of Language Models
Figure 3 for Bolmo: Byteifying the Next Generation of Language Models
Figure 4 for Bolmo: Byteifying the Next Generation of Language Models
Viaarxiv icon

Olmo 3

Add code
Dec 15, 2025
Viaarxiv icon

olmOCR 2: Unit Test Rewards for Document OCR

Add code
Oct 22, 2025
Viaarxiv icon

Overview of the TREC 2024 NeuCLIR Track

Add code
Sep 17, 2025
Viaarxiv icon

FlexOlmo: Open Language Models for Flexible Data Use

Add code
Jul 09, 2025
Figure 1 for FlexOlmo: Open Language Models for Flexible Data Use
Figure 2 for FlexOlmo: Open Language Models for Flexible Data Use
Figure 3 for FlexOlmo: Open Language Models for Flexible Data Use
Figure 4 for FlexOlmo: Open Language Models for Flexible Data Use
Viaarxiv icon

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Add code
Jun 05, 2025
Viaarxiv icon

Teaching Models to Understand (but not Generate) High-risk Data

Add code
May 05, 2025
Viaarxiv icon

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Add code
Apr 15, 2025
Figure 1 for DataDecide: How to Predict Best Pretraining Data with Small Experiments
Figure 2 for DataDecide: How to Predict Best Pretraining Data with Small Experiments
Figure 3 for DataDecide: How to Predict Best Pretraining Data with Small Experiments
Figure 4 for DataDecide: How to Predict Best Pretraining Data with Small Experiments
Viaarxiv icon

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Add code
Apr 09, 2025
Viaarxiv icon