Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Florian Dietz

Auditing language models for hidden objectives

Mar 14, 2025

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax(+25 more)

Abstract:We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

Via

Access Paper or Ask Questions

IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently

Jan 01, 2025

Florian Dietz, Dietrich Klakow

Figure 1 for IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently

Figure 2 for IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently

Figure 3 for IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently

Figure 4 for IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently

Abstract:Solving arithmetic tasks is a simple and fundamental skill, yet modern Large Language Models (LLMs) have great difficulty with them. We introduce the Integrated Gated Calculator (IGC), a module that enables LLMs to perform arithmetic by emulating a calculator on the GPU. We finetune a Llama model with our module and test it on the BigBench Arithmetic benchmark, where it beats the State of the Art, outperforming all models on the benchmark, including models almost two orders of magnitude larger. Our approach takes only a single iteration to run and requires no external tools. It performs arithmetic operations entirely inside the LLM without the need to produce intermediate tokens. It is computationally efficient, interpretable, and avoids side-effects on tasks that do not require arithmetic operations. It reliably achieves 98\% to 99\% accuracy across multiple training runs and for all subtasks, including the substantially harder subtask of multiplication, which was previously unsolved.

Via

Access Paper or Ask Questions

Block-Operations: Using Modular Routing to Improve Compositional Generalization

Aug 01, 2024

Florian Dietz, Dietrich Klakow

Abstract:We explore the hypothesis that poor compositional generalization in neural networks is caused by difficulties with learning effective routing. To solve this problem, we propose the concept of block-operations, which is based on splitting all activation tensors in the network into uniformly sized blocks and using an inductive bias to encourage modular routing and modification of these blocks. Based on this concept we introduce the Multiplexer, a new architectural component that enhances the Feed Forward Neural Network (FNN). We experimentally confirm that Multiplexers exhibit strong compositional generalization. On both a synthetic and a realistic task our model was able to learn the underlying process behind the task, whereas both FNNs and Transformers were only able to learn heuristic approximations. We propose as future work to use the principles of block-operations to improve other existing architectures.

Via

Access Paper or Ask Questions

Comgra: A Tool for Analyzing and Debugging Neural Networks

Jul 31, 2024

Florian Dietz, Sophie Fellenz, Dietrich Klakow, Marius Kloft

Figure 1 for Comgra: A Tool for Analyzing and Debugging Neural Networks

Abstract:Neural Networks are notoriously difficult to inspect. We introduce comgra, an open source python library for use with PyTorch. Comgra extracts data about the internal activations of a model and organizes it in a GUI (graphical user interface). It can show both summary statistics and individual data points, compare early and late stages of training, focus on individual samples of interest, and visualize the flow of the gradient through the network. This makes it possible to inspect the model's behavior from many different angles and save time by rapidly testing different hypotheses without having to rerun it. Comgra has applications for debugging, neural architecture design, and mechanistic interpretability. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at https://github.com/FlorianDietz/comgra.

Via

Access Paper or Ask Questions

Enhancing Reinforcement Learning with discrete interfaces to learn the Dyck Language

Oct 27, 2021

Florian Dietz, Dietrich Klakow

Figure 1 for Enhancing Reinforcement Learning with discrete interfaces to learn the Dyck Language

Figure 2 for Enhancing Reinforcement Learning with discrete interfaces to learn the Dyck Language

Abstract:Even though most interfaces in the real world are discrete, no efficient way exists to train neural networks to make use of them, yet. We enhance an Interaction Network (a Reinforcement Learning architecture) with discrete interfaces and train it on the generalized Dyck language. This task requires an understanding of hierarchical structures to solve, and has long proven difficult for neural networks. We provide the first solution based on learning to use discrete data structures. We encountered unexpected anomalous behavior during training, and utilized pre-training based on execution traces to overcome them. The resulting model is very small and fast, and generalizes to sequences that are an entire order of magnitude longer than the training data.

Via

Access Paper or Ask Questions

Interaction Networks: Using a Reinforcement Learner to train other Machine Learning algorithms

Jun 15, 2020

Florian Dietz

Figure 1 for Interaction Networks: Using a Reinforcement Learner to train other Machine Learning algorithms

Figure 2 for Interaction Networks: Using a Reinforcement Learner to train other Machine Learning algorithms

Figure 3 for Interaction Networks: Using a Reinforcement Learner to train other Machine Learning algorithms

Figure 4 for Interaction Networks: Using a Reinforcement Learner to train other Machine Learning algorithms

Abstract:The wiring of neurons in the brain is more flexible than the wiring of connections in contemporary artificial neural networks. It is possible that this extra flexibility is important for efficient problem solving and learning. This paper introduces the Interaction Network. Interaction Networks aim to capture some of this extra flexibility. An Interaction Network consists of a collection of conventional neural networks, a set of memory locations, and a DQN or other reinforcement learner. The DQN decides when each of the neural networks is executed, and on what memory locations. In this way, the individual neural networks can be trained on different data, for different tasks. At the same time, the results of the individual networks influence the decision process of the reinforcement learner. This results in a feedback loop that allows the DQN to perform actions that improve its own decision-making. Any existing type of neural network can be reproduced in an Interaction Network in its entirety, with only a constant computational overhead. Interaction Networks can then introduce additional features to improve performance further. These make the algorithm more flexible and general, but at the expense of being harder to train. In this paper, thought experiments are used to explore how the additional abilities of Interaction Networks could be used to improve various existing types of neural networks. Several experiments have been run to prove that the concept is sound. These show that the basic idea works, but they also reveal a number of challenges that do not appear in conventional neural networks, which make Interaction Networks very hard to train. Further research needs to be done to alleviate these issues. A number of promising avenues of research to achieve this are outlined in this paper.

Via

Access Paper or Ask Questions