Picture for Dieuwke Hupkes

Dieuwke Hupkes

Jack

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Add code
Nov 06, 2024
Viaarxiv icon

The Llama 3 Herd of Models

Add code
Jul 31, 2024
Viaarxiv icon

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Add code
Jun 18, 2024
Viaarxiv icon

Quantifying Variance in Evaluation Benchmarks

Add code
Jun 14, 2024
Figure 1 for Quantifying Variance in Evaluation Benchmarks
Figure 2 for Quantifying Variance in Evaluation Benchmarks
Figure 3 for Quantifying Variance in Evaluation Benchmarks
Figure 4 for Quantifying Variance in Evaluation Benchmarks
Viaarxiv icon

Interpretability of Language Models via Task Spaces

Add code
Jun 10, 2024
Viaarxiv icon

From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Add code
Apr 18, 2024
Viaarxiv icon

The ICL Consistency Test

Add code
Dec 08, 2023
Viaarxiv icon

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Add code
Nov 27, 2023
Viaarxiv icon

Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

Add code
Nov 09, 2023
Viaarxiv icon

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

Add code
Oct 26, 2023
Viaarxiv icon