Picture for Jordan Lee Boyd-Graber

Jordan Lee Boyd-Graber

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Add code
Mar 09, 2025
Viaarxiv icon

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Add code
Feb 27, 2025
Viaarxiv icon

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Add code
Feb 19, 2025
Viaarxiv icon

Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL

Add code
Feb 18, 2025
Viaarxiv icon

Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

Add code
Jan 20, 2025
Viaarxiv icon

Personalized Help for Optimizing Low-Skilled Users' Strategy

Add code
Nov 14, 2024
Viaarxiv icon

ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

Add code
Jun 24, 2024
Figure 1 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks
Figure 2 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks
Figure 3 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks
Figure 4 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks
Viaarxiv icon

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Add code
Jun 16, 2024
Figure 1 for AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Figure 2 for AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Figure 3 for AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Figure 4 for AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Viaarxiv icon

More Victories, Less Cooperation: Assessing Cicero's Diplomacy Play

Add code
Jun 07, 2024
Viaarxiv icon

PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation

Add code
Feb 17, 2024
Figure 1 for PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation
Figure 2 for PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation
Figure 3 for PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation
Figure 4 for PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation
Viaarxiv icon