Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shashwat Goel

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Jul 03, 2025

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

Abstract:Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

* 34 pages, Code is available at https://github.com/nikhilchandak/answer-matching

Via

Access Paper or Ask Questions

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Feb 26, 2025

Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu

Abstract:There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only <9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.

* Technical Report

Via

Access Paper or Ask Questions

Great Models Think Alike and this Undermines AI Oversight

Feb 06, 2025

Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

Abstract:As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

* 60 pages, 20 figures

Via

Access Paper or Ask Questions

A Cognac shot to forget bad memories: Corrective Unlearning in GNNs

Dec 01, 2024

Varshita Kolipaka, Akshit Sinha, Debangan Mishra, Sumit Kumar, Arvindh Arun, Shashwat Goel, Ponnurangam Kumaraguru

Figure 1 for A Cognac shot to forget bad memories: Corrective Unlearning in GNNs

Figure 2 for A Cognac shot to forget bad memories: Corrective Unlearning in GNNs

Figure 3 for A Cognac shot to forget bad memories: Corrective Unlearning in GNNs

Figure 4 for A Cognac shot to forget bad memories: Corrective Unlearning in GNNs

Abstract:Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. As graph data does not follow the independently and identically distributed (i.i.d) assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, deteriorating the model's performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem of Corrective Unlearning. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method, Cognac, which can unlearn the effect of the manipulation set even when only 5% of it is identified. It recovers most of the performance of a strong oracle with fully corrected training data, even beating retraining from scratch without the deletion set while being 8x more efficient. We hope our work guides GNN developers in fixing harmful effects due to issues in real-world data post-training.

Via

Access Paper or Ask Questions

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Mar 06, 2024

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan(+44 more)

Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Abstract:The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 4,157 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop CUT, a state-of-the-art unlearning method based on controlling model representations. CUT reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

* See the project page at https://wmdp.ai

Via

Access Paper or Ask Questions

Corrective Machine Unlearning

Feb 21, 2024

Shashwat Goel, Ameya Prabhu, Philip Torr, Ponnurangam Kumaraguru, Amartya Sanyal

Figure 1 for Corrective Machine Unlearning

Figure 2 for Corrective Machine Unlearning

Figure 3 for Corrective Machine Unlearning

Figure 4 for Corrective Machine Unlearning

Abstract:Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects like vulnerability to backdoored samples, systematic biases, and in general, reduced accuracy on certain input domains. Often, all manipulated training samples are not known, and only a small, representative subset of the affected data is flagged. We formalize "Corrective Machine Unlearning" as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, possibly knowing only a subset of impacted samples. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacy-oriented unlearning. We find most existing unlearning methods, including the gold-standard retraining-from-scratch, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, SSD, achieves limited success in unlearning adverse effects with just a small portion of the manipulated samples, showing the tractability of this setting. We hope our work spurs research towards developing better methods for corrective unlearning and offers practitioners a new strategy to handle data integrity challenges arising from web-scale training.

* 17 pages, 7 figures

Via

Access Paper or Ask Questions

Representation Engineering: A Top-Down Approach to AI Transparency

Oct 10, 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski(+11 more)

Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency

Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency

Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency

Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency

Abstract:In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

* Code is available at https://github.com/andyzoujm/representation-engineering

Via

Access Paper or Ask Questions

Proportional Aggregation of Preferences for Sequential Decision Making

Jun 26, 2023

Nikhil Chandak, Shashwat Goel, Dominik Peters

Abstract:We study the problem of fair sequential decision making given voter preferences. In each round, a decision rule must choose a decision from a set of alternatives where each voter reports which of these alternatives they approve. Instead of going with the most popular choice in each round, we aim for proportional representation. We formalize this aim using axioms based on Proportional Justified Representation (PJR), which were proposed in the literature on multi-winner voting and were recently adapted to multi-issue decision making. The axioms require that every group of $\alpha\%$ of the voters, if it agrees in every round (i.e., approves a common alternative), then those voters must approve at least $\alpha\%$ of the decisions. A stronger version of the axioms requires that every group of $\alpha\%$ of the voters that agrees in a $\beta$ fraction of rounds must approve $\beta\cdot\alpha\%$ of the decisions. We show that three attractive voting rules satisfy axioms of this style. One of them (Sequential Phragm\'en) makes its decisions online, and the other two satisfy strengthened versions of the axioms but make decisions semi-online (Method of Equal Shares) or fully offline (Proportional Approval Voting). The first two are polynomial-time computable, and the latter is based on an NP-hard optimization, but it admits a polynomial-time local search algorithm that satisfies the same axiomatic properties. We present empirical results about the performance of these rules based on synthetic data and U.S. political elections. We also run experiments where votes are cast by preference models trained on user responses from the moral machine dataset about ethical dilemmas.

* 35 pages

Via

Access Paper or Ask Questions

Low impact agency: review and discussion

Mar 06, 2023

Danilo Naiff, Shashwat Goel

Figure 1 for Low impact agency: review and discussion

Figure 2 for Low impact agency: review and discussion

Figure 3 for Low impact agency: review and discussion

Figure 4 for Low impact agency: review and discussion

Abstract:Powerful artificial intelligence poses an existential threat if the AI decides to drastically change the world in pursuit of its goals. The hope of low-impact artificial intelligence is to incentivize AI to not do that just because this causes a large impact in the world. In this work, we first review the concept of low-impact agency and previous proposals to approach the problem, and then propose future research directions in the topic, with the goal to ensure low-impactedness is useful in making AI safe.

* Work done as part of the SERIMATS 3.0 training program

Via

Access Paper or Ask Questions

Evaluating Inexact Unlearning Requires Revisiting Forgetting

Jan 17, 2022

Shashwat Goel, Ameya Prabhu, Ponnurangam Kumaraguru

Figure 1 for Evaluating Inexact Unlearning Requires Revisiting Forgetting

Figure 2 for Evaluating Inexact Unlearning Requires Revisiting Forgetting

Figure 3 for Evaluating Inexact Unlearning Requires Revisiting Forgetting

Figure 4 for Evaluating Inexact Unlearning Requires Revisiting Forgetting

Abstract:Existing works in inexact machine unlearning focus on achieving indistinguishability from models retrained after removing the deletion set. We argue that indistinguishability is unnecessary, infeasible to measure, and its practical relaxations can be insufficient. We redefine the goal of unlearning as forgetting all information specific to the deletion set while maintaining high utility and resource efficiency. Motivated by the practical application of removing mislabelled and biased data from models, we introduce a novel test to measure the degree of forgetting called Interclass Confusion (IC). It allows us to analyze two aspects of forgetting: (i) memorization and (ii) property generalization. Despite being a black-box test, IC can investigate whether information from the deletion set was erased until the early layers of the network. We empirically show that two simple unlearning methods, exact-unlearning and catastrophic-forgetting the final k layers of a network, scale well to large deletion sets unlike prior unlearning methods. k controls the forgetting-efficiency tradeoff at similar utility. Overall, we believe our formulation of unlearning and the IC test will guide the design of better unlearning algorithms.

Via

Access Paper or Ask Questions