Picture for Martin Jaggi

Martin Jaggi

EPFL

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

Add code
Apr 22, 2026
Viaarxiv icon

Weight Decay may matter more than muP for Learning Rate Transfer in Practice

Add code
Oct 21, 2025
Viaarxiv icon

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Add code
Sep 17, 2025
Figure 1 for Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Figure 2 for Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Figure 3 for Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Figure 4 for Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Viaarxiv icon

TiMoE: Time-Aware Mixture of Language Experts

Add code
Aug 12, 2025
Figure 1 for TiMoE: Time-Aware Mixture of Language Experts
Figure 2 for TiMoE: Time-Aware Mixture of Language Experts
Figure 3 for TiMoE: Time-Aware Mixture of Language Experts
Figure 4 for TiMoE: Time-Aware Mixture of Language Experts
Viaarxiv icon

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Add code
Jun 26, 2025
Viaarxiv icon

Gradient-Normalized Smoothness for Optimization with Approximate Hessians

Add code
Jun 16, 2025
Figure 1 for Gradient-Normalized Smoothness for Optimization with Approximate Hessians
Figure 2 for Gradient-Normalized Smoothness for Optimization with Approximate Hessians
Figure 3 for Gradient-Normalized Smoothness for Optimization with Approximate Hessians
Figure 4 for Gradient-Normalized Smoothness for Optimization with Approximate Hessians
Viaarxiv icon

GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining

Add code
May 26, 2025
Figure 1 for GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining
Figure 2 for GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining
Figure 3 for GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining
Figure 4 for GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining
Viaarxiv icon

Towards Fully FP8 GEMM LLM Training at Scale

Add code
May 26, 2025
Viaarxiv icon

URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

Add code
May 22, 2025
Viaarxiv icon

NeuralGrok: Accelerate Grokking by Neural Gradient Transformation

Add code
Apr 24, 2025
Viaarxiv icon