Picture for Alex Gu

Alex Gu

Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

Add code
Jun 11, 2026
Viaarxiv icon

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Add code
Mar 25, 2026
Viaarxiv icon

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Add code
Mar 04, 2026
Viaarxiv icon

Completion $ eq$ Collaboration: Scaling Collaborative Effort with Agents

Add code
Oct 30, 2025
Viaarxiv icon

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

Add code
Aug 26, 2025
Viaarxiv icon

Solving Inequality Proofs with Large Language Models

Add code
Jun 09, 2025
Viaarxiv icon

Challenges and Paths Towards AI for Software Engineering

Add code
Mar 28, 2025
Viaarxiv icon

Mixture of Parrots: Experts improve memorization more than reasoning

Add code
Oct 24, 2024
Figure 1 for Mixture of Parrots: Experts improve memorization more than reasoning
Figure 2 for Mixture of Parrots: Experts improve memorization more than reasoning
Figure 3 for Mixture of Parrots: Experts improve memorization more than reasoning
Figure 4 for Mixture of Parrots: Experts improve memorization more than reasoning
Viaarxiv icon

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Add code
Jun 26, 2024
Figure 1 for BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Figure 2 for BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Figure 3 for BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Figure 4 for BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Viaarxiv icon

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Add code
Mar 12, 2024
Figure 1 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Figure 2 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Figure 3 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Figure 4 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Viaarxiv icon