Picture for Graham Neubig

Graham Neubig

Carnegie Mellon University

OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

Add code
Jul 08, 2025
Viaarxiv icon

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

Add code
Jun 16, 2025
Viaarxiv icon

CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

Add code
Jun 10, 2025
Viaarxiv icon

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

Add code
May 26, 2025
Viaarxiv icon

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Add code
May 15, 2025
Viaarxiv icon

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Add code
Apr 15, 2025
Viaarxiv icon

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

Add code
Apr 10, 2025
Viaarxiv icon

Inducing Programmatic Skills for Agentic Tasks

Add code
Apr 09, 2025
Viaarxiv icon

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Add code
Apr 09, 2025
Viaarxiv icon

M-Prometheus: A Suite of Open Multilingual LLM Judges

Add code
Apr 07, 2025
Viaarxiv icon