Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zainab Ali Majid

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Apr 18, 2025

Jaime Raldua Veuthey, Zainab Ali Majid, Suhas Hariharan, Jacob Haimes

Figure 1 for MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Figure 2 for MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Figure 3 for MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Figure 4 for MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Abstract:As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.

Via

Access Paper or Ask Questions

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

Nov 13, 2024

Suhas Hariharan, Zainab Ali Majid, Jaime Raldua Veuthey, Jacob Haimes

Figure 1 for Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

Abstract:A key development in the cybersecurity evaluations space is the work carried out by Meta, through their CyberSecEval approach. While this work is undoubtedly a useful contribution to a nascent field, there are notable features that limit its utility. Key drawbacks focus on the insecure code detection part of Meta's methodology. We explore these limitations, and use our exploration as a test case for LLM-assisted benchmark analysis.

* NeurIPS 2024, 2 pages

Via

Access Paper or Ask Questions

Extracting Paragraphs from LLM Token Activations

Sep 10, 2024

Nicholas Pochinkov, Angelo Benoit, Lovkush Agarwal, Zainab Ali Majid, Lucile Ter-Minassian

Figure 1 for Extracting Paragraphs from LLM Token Activations

Figure 2 for Extracting Paragraphs from LLM Token Activations

Figure 3 for Extracting Paragraphs from LLM Token Activations

Figure 4 for Extracting Paragraphs from LLM Token Activations

Abstract:Generative large language models (LLMs) excel in natural language processing tasks, yet their inner workings remain underexplored beyond token-level predictions. This study investigates the degree to which these models decide the content of a paragraph at its onset, shedding light on their contextual understanding. By examining the information encoded in single-token activations, specifically the "\textbackslash n\textbackslash n" double newline token, we demonstrate that patching these activations can transfer significant information about the context of the following paragraph, providing further insights into the model's capacity to plan ahead.

Via

Access Paper or Ask Questions