Picture for Michal Shmueli-Scheuer

Michal Shmueli-Scheuer

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Add code
May 27, 2026
Viaarxiv icon

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Add code
May 21, 2026
Viaarxiv icon

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Add code
May 18, 2026
Viaarxiv icon

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Add code
Apr 14, 2026
Viaarxiv icon

CUBE: A Standard for Unifying Agent Benchmarks

Add code
Mar 16, 2026
Viaarxiv icon

General Agent Evaluation

Add code
Feb 26, 2026
Viaarxiv icon

Robustness as an Emergent Property of Task Performance

Add code
Feb 03, 2026
Viaarxiv icon

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Add code
Jan 22, 2026
Viaarxiv icon

Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

Add code
May 26, 2025
Figure 1 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Figure 2 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Figure 3 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Figure 4 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Viaarxiv icon

Survey on Evaluation of LLM-based Agents

Add code
Mar 20, 2025
Viaarxiv icon