Michael Pokorny
Abstract:Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples.
Abstract:The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI.
Abstract:Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Abstract:AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at betterbench.stanford.edu.
Abstract:There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter $T$. Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature $T = 0$. We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.
Abstract:Chain-of-Thought (CoT) reasoning could in principle enable a deeper understanding of a language model's (LM) internal reasoning. However, prior work suggests that some LMs answer questions similarly despite changes in their CoT, suggesting that those models are not truly using the CoT. We propose a training method to produce CoTs that are sufficient alone for predicting future text, independent of other context. This methodology gives a guarantee that if the LM can predict future tokens, then it must have used the CoT to understand its context. We formalize the idea that the truthfulness of a sender to a receiver LM is the degree to which the sender helps the receiver predict their future observations. Then we define a "Markovian" LM as one which predicts future text given only a CoT as context. We derive a "Markovian training" procedure by applying our definition of truthfulness to a Markovian LM and optimizing via policy gradient and Proximal Policy Optimization (PPO). We demonstrate the effectiveness of our training algorithm on long-context arithmetic problems, show that the model utilizes the CoT, and validate that the generated CoT is meaningful and usable by other models.
Abstract:Wargames have a long history in the development of military strategy and the response of nations to threats or attacks. The advent of artificial intelligence (AI) promises better decision-making and increased military effectiveness. However, there is still debate about how AI systems, especially large language models (LLMs), behave as compared to humans. To this end, we use a wargame experiment with 107 national security expert human players designed to look at crisis escalation in a fictional US-China scenario and compare human players to LLM-simulated responses. We find considerable agreement in the LLM and human responses but also significant quantitative and qualitative differences between simulated and human players in the wargame, motivating caution to policymakers before handing over autonomy or following AI-based strategy recommendations.
Abstract:Governments are increasingly considering integrating autonomous AI agents in high-stakes military and foreign-policy decision-making, especially with the emergence of advanced generative AI models like GPT-4. Our work aims to scrutinize the behavior of multiple AI agents in simulated wargames, specifically focusing on their predilection to take escalatory actions that may exacerbate multilateral conflicts. Drawing on political science and international relations literature about escalation dynamics, we design a novel wargame simulation and scoring framework to assess the escalation risks of actions taken by these agents in different scenarios. Contrary to prior studies, our research provides both qualitative and quantitative insights and focuses on large language models (LLMs). We find that all five studied off-the-shelf LLMs show forms of escalation and difficult-to-predict escalation patterns. We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons. Qualitatively, we also collect the models' reported reasonings for chosen actions and observe worrying justifications based on deterrence and first-strike tactics. Given the high stakes of military and foreign-policy contexts, we recommend further examination and cautious consideration before deploying autonomous language model agents for strategic military or diplomatic decision-making.
Abstract:We describe a software package, TomOpt, developed to optimise the geometrical layout and specifications of detectors designed for tomography by scattering of cosmic-ray muons. The software exploits differentiable programming for the modeling of muon interactions with detectors and scanned volumes, the inference of volume properties, and the optimisation cycle performing the loss minimisation. In doing so, we provide the first demonstration of end-to-end-differentiable and inference-aware optimisation of particle physics instruments. We study the performance of the software on a relevant benchmark scenarios and discuss its potential applications.
Abstract:Recent advancements in interpretability research made transformer language models more transparent. This progress led to a better understanding of their inner workings for toy and naturally occurring models. However, how these models internally process sentiment changes has yet to be sufficiently answered. In this work, we introduce a new interpretability tool called PCP ablation, where we replace modules with low-rank matrices based on the principal components of their activations, reducing model parameters and their behavior to essentials. We demonstrate PCP ablations on MLP and attention layers in backdoored toy, backdoored large, and naturally occurring models. We determine MLPs as most important for the backdoor mechanism and use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements via PCP ablation.