Picture for Bill Yuchen Lin

Bill Yuchen Lin

Shammie

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Add code
Nov 26, 2024
Viaarxiv icon

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Add code
Nov 12, 2024
Viaarxiv icon

On Memorization of Large Language Models in Logical Reasoning

Add code
Oct 30, 2024
Figure 1 for On Memorization of Large Language Models in Logical Reasoning
Figure 2 for On Memorization of Large Language Models in Logical Reasoning
Figure 3 for On Memorization of Large Language Models in Logical Reasoning
Figure 4 for On Memorization of Large Language Models in Logical Reasoning
Viaarxiv icon

Latent Action Pretraining from Videos

Add code
Oct 15, 2024
Figure 1 for Latent Action Pretraining from Videos
Figure 2 for Latent Action Pretraining from Videos
Figure 3 for Latent Action Pretraining from Videos
Figure 4 for Latent Action Pretraining from Videos
Viaarxiv icon

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Add code
Oct 03, 2024
Figure 1 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 2 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 3 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 4 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Viaarxiv icon

Visual Perception in Text Strings

Add code
Oct 02, 2024
Viaarxiv icon

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Add code
Sep 26, 2024
Figure 1 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 2 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 3 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 4 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Viaarxiv icon

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Add code
Sep 11, 2024
Figure 1 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 2 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 3 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 4 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Viaarxiv icon

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Add code
Jul 26, 2024
Figure 1 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 2 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 3 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 4 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Viaarxiv icon

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Add code
Jul 15, 2024
Viaarxiv icon