Picture for Bill Yuchen Lin

Bill Yuchen Lin

Shammie

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Add code
Nov 26, 2024
Viaarxiv icon

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Add code
Nov 12, 2024
Figure 1 for Stronger Models are NOT Stronger Teachers for Instruction Tuning
Figure 2 for Stronger Models are NOT Stronger Teachers for Instruction Tuning
Figure 3 for Stronger Models are NOT Stronger Teachers for Instruction Tuning
Figure 4 for Stronger Models are NOT Stronger Teachers for Instruction Tuning
Viaarxiv icon

On Memorization of Large Language Models in Logical Reasoning

Add code
Oct 30, 2024
Figure 1 for On Memorization of Large Language Models in Logical Reasoning
Figure 2 for On Memorization of Large Language Models in Logical Reasoning
Figure 3 for On Memorization of Large Language Models in Logical Reasoning
Figure 4 for On Memorization of Large Language Models in Logical Reasoning
Viaarxiv icon

Latent Action Pretraining from Videos

Add code
Oct 15, 2024
Figure 1 for Latent Action Pretraining from Videos
Figure 2 for Latent Action Pretraining from Videos
Figure 3 for Latent Action Pretraining from Videos
Figure 4 for Latent Action Pretraining from Videos
Viaarxiv icon

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Add code
Oct 03, 2024
Figure 1 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 2 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 3 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 4 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Viaarxiv icon

Visual Perception in Text Strings

Add code
Oct 02, 2024
Viaarxiv icon

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Add code
Sep 26, 2024
Figure 1 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 2 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 3 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 4 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Viaarxiv icon

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Add code
Sep 11, 2024
Figure 1 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 2 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 3 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 4 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Viaarxiv icon

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Add code
Jul 26, 2024
Figure 1 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 2 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 3 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 4 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Viaarxiv icon

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Add code
Jul 15, 2024
Viaarxiv icon