Picture for Michael Krumdick

Michael Krumdick

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

Add code
Apr 07, 2026
Viaarxiv icon

Cost-Efficient Estimation of General Abilities Across Benchmarks

Add code
Apr 01, 2026
Viaarxiv icon

On Finding Inconsistencies in Documents

Add code
Dec 21, 2025
Viaarxiv icon

Complexity Scaling Laws for Neural Models using Combinatorial Optimization

Add code
Jun 15, 2025
Viaarxiv icon

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Add code
May 16, 2025
Figure 1 for BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Figure 2 for BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Figure 3 for BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Figure 4 for BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Viaarxiv icon

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

Add code
Mar 07, 2025
Viaarxiv icon

Are Language Model Logits Calibrated?

Add code
Oct 21, 2024
Viaarxiv icon

SEC-QA: A Systematic Evaluation Corpus for Financial QA

Add code
Jun 20, 2024
Viaarxiv icon

An Analysis of Multilingual FActScore

Add code
Jun 20, 2024
Viaarxiv icon

BizBench: A Quantitative Reasoning Benchmark for Business and Finance

Add code
Nov 11, 2023
Viaarxiv icon