Picture for Xuezhi Cao

Xuezhi Cao

Alphabetical order by last name

CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions

Add code
Oct 30, 2025
Viaarxiv icon

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Add code
Oct 30, 2025
Figure 1 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Figure 2 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Figure 3 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Figure 4 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Viaarxiv icon

Making Mathematical Reasoning Adaptive

Add code
Oct 06, 2025
Viaarxiv icon

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Add code
Sep 18, 2025
Viaarxiv icon

Instance-level Randomization: Toward More Stable LLM Evaluations

Add code
Sep 16, 2025
Viaarxiv icon

HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs

Add code
Jun 16, 2025
Viaarxiv icon

OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics

Add code
Jun 12, 2025
Viaarxiv icon

NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Add code
May 22, 2025
Viaarxiv icon

ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Add code
May 20, 2025
Figure 1 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Figure 2 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Figure 3 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Figure 4 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Viaarxiv icon

Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement

Add code
May 17, 2025
Viaarxiv icon