Picture for Hu Wei

Hu Wei

Architectural Design Decisions in AI Agent Harnesses

Add code
Apr 20, 2026
Viaarxiv icon

From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution

Add code
Apr 13, 2026
Viaarxiv icon

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Add code
Apr 04, 2026
Viaarxiv icon

IndustryCode: A Benchmark for Industry Code Generation

Add code
Apr 03, 2026
Viaarxiv icon

Logics-Parsing-Omni Technical Report

Add code
Mar 12, 2026
Viaarxiv icon

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Add code
Mar 04, 2026
Viaarxiv icon

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Add code
Mar 03, 2026
Viaarxiv icon

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Add code
Feb 26, 2026
Viaarxiv icon

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Add code
Feb 18, 2026
Viaarxiv icon

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Add code
Feb 17, 2026
Viaarxiv icon