Picture for William Yang Wang

William Yang Wang

AgentOrca: A Dual-System Framework to Evaluate Language Agents on Operational Routine and Constraint Adherence

Add code
Mar 11, 2025
Viaarxiv icon

InductionBench: LLMs Fail in the Simplest Complexity Class

Add code
Feb 26, 2025
Viaarxiv icon

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Add code
Feb 20, 2025
Figure 1 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Figure 2 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Figure 3 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Figure 4 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Viaarxiv icon

MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison

Add code
Feb 07, 2025
Figure 1 for MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
Figure 2 for MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
Figure 3 for MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
Figure 4 for MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
Viaarxiv icon

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Add code
Jan 13, 2025
Viaarxiv icon

Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework

Add code
Dec 22, 2024
Viaarxiv icon

AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge

Add code
Dec 18, 2024
Figure 1 for AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Figure 2 for AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Figure 3 for AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Figure 4 for AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Viaarxiv icon

Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning

Add code
Dec 15, 2024
Viaarxiv icon

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Add code
Dec 12, 2024
Figure 1 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Figure 2 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Figure 3 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Figure 4 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Viaarxiv icon

Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Add code
Nov 27, 2024
Viaarxiv icon