Picture for José Hernández-Orallo

José Hernández-Orallo

From Human-Level AI Tales to AI Leveling Human Scales

Add code
Feb 21, 2026
Viaarxiv icon

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Add code
Feb 19, 2026
Viaarxiv icon

Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Add code
Jan 20, 2026
Viaarxiv icon

11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

Add code
Aug 27, 2025
Viaarxiv icon

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Add code
Jun 10, 2025
Figure 1 for Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
Figure 2 for Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
Figure 3 for Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
Figure 4 for Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
Viaarxiv icon

Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Add code
May 14, 2025
Viaarxiv icon

Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI

Add code
Mar 27, 2025
Figure 1 for Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI
Figure 2 for Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI
Viaarxiv icon

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Add code
Mar 09, 2025
Viaarxiv icon

Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

Add code
Feb 21, 2025
Viaarxiv icon

PredictaBoard: Benchmarking LLM Score Predictability

Add code
Feb 20, 2025
Viaarxiv icon