Picture for Samuel Albanie

Samuel Albanie

Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Add code
Dec 18, 2024
Viaarxiv icon

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Add code
Dec 09, 2024
Figure 1 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 2 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 3 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 4 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Viaarxiv icon

How to Merge Your Multimodal Models Over Time?

Add code
Dec 09, 2024
Viaarxiv icon

Active Data Curation Effectively Distills Large-Scale Multimodal Models

Add code
Nov 27, 2024
Viaarxiv icon

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Add code
Nov 07, 2024
Figure 1 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Figure 2 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Figure 3 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Figure 4 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Viaarxiv icon

A Practitioner's Guide to Continual Multimodal Pretraining

Add code
Aug 26, 2024
Viaarxiv icon

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Add code
Aug 21, 2024
Figure 1 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Figure 2 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Figure 3 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Figure 4 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Viaarxiv icon

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Figure 1 for On scalable oversight with weak LLMs judging strong LLMs
Figure 2 for On scalable oversight with weak LLMs judging strong LLMs
Figure 3 for On scalable oversight with weak LLMs judging strong LLMs
Figure 4 for On scalable oversight with weak LLMs judging strong LLMs
Viaarxiv icon

HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

Add code
Jun 05, 2024
Figure 1 for HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Figure 2 for HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Figure 3 for HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Figure 4 for HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Viaarxiv icon

Inverse Constitutional AI: Compressing Preferences into Principles

Add code
Jun 02, 2024
Viaarxiv icon