Picture for Moritz Hardt

Moritz Hardt

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

Add code
Jun 08, 2026
Viaarxiv icon

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Add code
May 14, 2026
Viaarxiv icon

Computational Arbitrage in AI Model Markets

Add code
Mar 23, 2026
Viaarxiv icon

Leaderboard Incentives: Model Rankings under Strategic Post-Training

Add code
Mar 09, 2026
Viaarxiv icon

Good Allocations from Bad Estimates

Add code
Jan 09, 2026
Viaarxiv icon

Scaling Open-Ended Reasoning to Predict the Future

Add code
Dec 31, 2025
Viaarxiv icon

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Add code
Oct 06, 2025
Figure 1 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Figure 2 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Figure 3 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Figure 4 for Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Viaarxiv icon

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Add code
Jul 03, 2025
Viaarxiv icon

How Benchmark Prediction from Fewer Data Misses the Mark

Add code
Jun 09, 2025
Figure 1 for How Benchmark Prediction from Fewer Data Misses the Mark
Figure 2 for How Benchmark Prediction from Fewer Data Misses the Mark
Figure 3 for How Benchmark Prediction from Fewer Data Misses the Mark
Figure 4 for How Benchmark Prediction from Fewer Data Misses the Mark
Viaarxiv icon

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

Add code
Oct 17, 2024
Figure 1 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Figure 2 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Figure 3 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Figure 4 for Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Viaarxiv icon