Picture for Percy Liang

Percy Liang

Shammie

UQ: Assessing Language Models on Unsolved Questions

Add code
Aug 25, 2025
Viaarxiv icon

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Add code
Jul 03, 2025
Viaarxiv icon

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Add code
May 26, 2025
Viaarxiv icon

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Add code
May 21, 2025
Viaarxiv icon

Extracting memorized pieces of (copyrighted) books from open-weight language models

Add code
May 18, 2025
Viaarxiv icon

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Add code
May 12, 2025
Viaarxiv icon

Reliable and Efficient Amortized Model-based Evaluation

Add code
Mar 17, 2025
Viaarxiv icon

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Add code
Feb 27, 2025
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Figure 1 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 2 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 3 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 4 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Viaarxiv icon

Independence Tests for Language Models

Add code
Feb 17, 2025
Viaarxiv icon