Picture for Zhongyuan Peng

Zhongyuan Peng

CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

Add code
Feb 02, 2026
Viaarxiv icon

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Add code
Jan 29, 2026
Viaarxiv icon

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Add code
Jan 08, 2026
Viaarxiv icon

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

Add code
Jul 08, 2025
Figure 1 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Figure 2 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Figure 3 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Figure 4 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Viaarxiv icon

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Add code
May 05, 2025
Figure 1 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Figure 2 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Figure 3 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Figure 4 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Viaarxiv icon

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Add code
Apr 21, 2025
Viaarxiv icon

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Add code
Feb 26, 2025
Figure 1 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Figure 2 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Figure 3 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Figure 4 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Viaarxiv icon

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

Add code
Feb 23, 2025
Figure 1 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Figure 2 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Figure 3 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Figure 4 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Viaarxiv icon

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Add code
Feb 20, 2025
Viaarxiv icon

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Add code
Oct 17, 2024
Figure 1 for A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Figure 2 for A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Figure 3 for A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Figure 4 for A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Viaarxiv icon