Picture for Yilun Zhao

Yilun Zhao

Z1: Efficient Test-time Scaling with Code

Add code
Apr 01, 2025
Viaarxiv icon

MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search

Add code
Mar 26, 2025
Viaarxiv icon

Survey on Evaluation of LLM-based Agents

Add code
Mar 20, 2025
Viaarxiv icon

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Add code
Mar 10, 2025
Viaarxiv icon

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

Add code
Mar 06, 2025
Viaarxiv icon

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Add code
Jan 21, 2025
Figure 1 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Figure 2 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Figure 3 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Figure 4 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Viaarxiv icon

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Add code
Jan 11, 2025
Viaarxiv icon

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Add code
Dec 30, 2024
Viaarxiv icon

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Add code
Nov 23, 2024
Viaarxiv icon

FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents

Add code
Nov 08, 2024
Figure 1 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Figure 2 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Figure 3 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Figure 4 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Viaarxiv icon