Picture for Summer Yue

Summer Yue

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

Add code
Feb 13, 2025
Viaarxiv icon

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Add code
Jan 29, 2025
Viaarxiv icon

Humanity's Last Exam

Add code
Jan 24, 2025
Viaarxiv icon

Planning In Natural Language Improves LLM Search For Code Generation

Add code
Sep 05, 2024
Figure 1 for Planning In Natural Language Improves LLM Search For Code Generation
Figure 2 for Planning In Natural Language Improves LLM Search For Code Generation
Figure 3 for Planning In Natural Language Improves LLM Search For Code Generation
Figure 4 for Planning In Natural Language Improves LLM Search For Code Generation
Viaarxiv icon

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Add code
Aug 27, 2024
Figure 1 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Figure 2 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Figure 3 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Figure 4 for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Viaarxiv icon

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Add code
May 02, 2024
Figure 1 for A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Figure 2 for A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Figure 3 for A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Figure 4 for A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

Gemini: A Family of Highly Capable Multimodal Models

Add code
Dec 19, 2023
Viaarxiv icon

RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Add code
Jun 04, 2021
Figure 1 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning
Figure 2 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning
Figure 3 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning
Figure 4 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning
Viaarxiv icon