Picture for Samuel Albanie

Samuel Albanie

Michael Pokorny

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Add code
Apr 09, 2025
Viaarxiv icon

An Approach to Technical AGI Safety and Security

Add code
Apr 02, 2025
Viaarxiv icon

Humanity's Last Exam

Add code
Jan 24, 2025
Viaarxiv icon

Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Add code
Dec 18, 2024
Figure 1 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Figure 2 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Figure 3 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Figure 4 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Viaarxiv icon

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Add code
Dec 09, 2024
Figure 1 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 2 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 3 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 4 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Viaarxiv icon

How to Merge Your Multimodal Models Over Time?

Add code
Dec 09, 2024
Figure 1 for How to Merge Your Multimodal Models Over Time?
Figure 2 for How to Merge Your Multimodal Models Over Time?
Figure 3 for How to Merge Your Multimodal Models Over Time?
Figure 4 for How to Merge Your Multimodal Models Over Time?
Viaarxiv icon

Active Data Curation Effectively Distills Large-Scale Multimodal Models

Add code
Nov 27, 2024
Viaarxiv icon

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Add code
Nov 07, 2024
Figure 1 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Figure 2 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Figure 3 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Figure 4 for Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Viaarxiv icon

A Practitioner's Guide to Continual Multimodal Pretraining

Add code
Aug 26, 2024
Viaarxiv icon

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Add code
Aug 21, 2024
Figure 1 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Figure 2 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Figure 3 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Figure 4 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Viaarxiv icon