Picture for Brad Kenstler

Brad Kenstler

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Add code
Jan 31, 2026
Viaarxiv icon

Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

Add code
Dec 16, 2025
Figure 1 for Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections
Figure 2 for Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections
Figure 3 for Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections
Figure 4 for Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections
Viaarxiv icon

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Add code
Nov 10, 2025
Viaarxiv icon

Remote Labor Index: Measuring AI Automation of Remote Work

Add code
Oct 30, 2025
Viaarxiv icon

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Add code
Mar 05, 2025
Figure 1 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Figure 2 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Figure 3 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Figure 4 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Viaarxiv icon