Picture for Ernesto Hernandez

Ernesto Hernandez

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Add code
Jan 31, 2026
Viaarxiv icon

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Add code
Nov 14, 2025
Viaarxiv icon

Remote Labor Index: Measuring AI Automation of Remote Work

Add code
Oct 30, 2025
Viaarxiv icon

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Add code
Jul 23, 2025
Figure 1 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 2 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 3 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 4 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Viaarxiv icon

Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue

Add code
Sep 15, 2017
Figure 1 for Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue
Figure 2 for Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue
Figure 3 for Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue
Figure 4 for Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue
Viaarxiv icon

Learning Fine-Grained Knowledge about Contingent Relations between Everyday Events

Add code
Aug 30, 2017
Figure 1 for Learning Fine-Grained Knowledge about Contingent Relations between Everyday Events
Figure 2 for Learning Fine-Grained Knowledge about Contingent Relations between Everyday Events
Viaarxiv icon