Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Danny Halawi

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Sep 30, 2024

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock

Figure 1 for ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Figure 2 for ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Figure 3 for ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Figure 4 for ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Abstract:Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the ability of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark (N = 200). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM (p-values <= 0.01). We display system and human scores in a public leaderboard at www.forecastbench.org.

Via

Access Paper or Ask Questions

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Jun 28, 2024

Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt

Figure 1 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Figure 2 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Figure 3 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Figure 4 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Abstract:Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

* 22 pages

Via

Access Paper or Ask Questions

Dominion: A New Frontier for AI Research

May 10, 2024

Danny Halawi, Aron Sarmasi, Siena Saltzen, Joshua McCoy

Figure 1 for Dominion: A New Frontier for AI Research

Figure 2 for Dominion: A New Frontier for AI Research

Figure 3 for Dominion: A New Frontier for AI Research

Abstract:In recent years, machine learning approaches have made dramatic advances, reaching superhuman performance in Go, Atari, and poker variants. These games, and others before them, have served not only as a testbed but have also helped to push the boundaries of AI research. Continuing this tradition, we examine the tabletop game Dominion and discuss the properties that make it well-suited to serve as a benchmark for the next generation of reinforcement learning (RL) algorithms. We also present the Dominion Online Dataset, a collection of over 2,000,000 games of Dominion played by experienced players on the Dominion Online webserver. Finally, we introduce an RL baseline bot that uses existing techniques to beat common heuristic-based bots, and shows competitive performance against the previously strongest bot, Provincial.

Via

Access Paper or Ask Questions

Approaching Human-Level Forecasting with Language Models

Feb 28, 2024

Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

Figure 1 for Approaching Human-Level Forecasting with Language Models

Figure 2 for Approaching Human-Level Forecasting with Language Models

Figure 3 for Approaching Human-Level Forecasting with Language Models

Figure 4 for Approaching Human-Level Forecasting with Language Models

Abstract:Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

Via

Access Paper or Ask Questions

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Jul 18, 2023

Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt

Figure 1 for Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Figure 2 for Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Figure 3 for Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Figure 4 for Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Abstract:Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.

Via

Access Paper or Ask Questions

Eliciting Latent Predictions from Transformers with the Tuned Lens

Mar 15, 2023

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

Abstract:We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier "logit lens" technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.

Via

Access Paper or Ask Questions