Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joshua Batson

When Models Manipulate Manifolds: The Geometry of a Counting Task

Jan 08, 2026

Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, Joshua Batson

Abstract:Language models can perceive visual properties of text despite receiving only sequences of tokens-we mechanistically investigate how Claude 3.5 Haiku accomplishes one such task: linebreaking in fixed-width text. We find that character counts are represented on low-dimensional curved manifolds discretized by sparse feature families, analogous to biological place cells. Accurate predictions emerge from a sequence of geometric transformations: token lengths are accumulated into character count manifolds, attention heads twist these manifolds to estimate distance to the line boundary, and the decision to break the line is enabled by arranging estimates orthogonally to create a linear decision boundary. We validate our findings through causal interventions and discover visual illusions--character sequences that hijack the counting mechanism. Our work demonstrates the rich sensory processing of early layers, the intricacy of attention algorithms, and the importance of combining feature-based and geometric views of interpretability.

Via

Access Paper or Ask Questions

Auditing language models for hidden objectives

Mar 14, 2025

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax(+25 more)

Figure 1 for Auditing language models for hidden objectives

Figure 2 for Auditing language models for hidden objectives

Figure 3 for Auditing language models for hidden objectives

Figure 4 for Auditing language models for hidden objectives

Abstract:We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

Via

Access Paper or Ask Questions

Open Problems in Mechanistic Interpretability

Jan 27, 2025

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom(+19 more)

Figure 1 for Open Problems in Mechanistic Interpretability

Figure 2 for Open Problems in Mechanistic Interpretability

Figure 3 for Open Problems in Mechanistic Interpretability

Figure 4 for Open Problems in Mechanistic Interpretability

Abstract:Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

Via

Access Paper or Ask Questions

Scaling Laws in Jet Classification

Dec 04, 2023

Joshua Batson, Yonatan Kahn

Abstract:We demonstrate the emergence of scaling laws in the benchmark top versus QCD jet classification problem in collider physics. Six distinct physically-motivated classifiers exhibit power-law scaling of the binary cross-entropy test loss as a function of training set size, with distinct power law indices. This result highlights the importance of comparing classifiers as a function of dataset size rather than for a fixed training set, as the optimal classifier may change considerably as the dataset is scaled up. We speculate on the interpretation of our results in terms of previous models of scaling laws observed in natural language and image datasets.

* 10+2 pages, 7 figures, 1 table

Via

Access Paper or Ask Questions

Topological Obstructions to Autoencoding

Feb 16, 2021

Joshua Batson, C. Grace Haaf, Yonatan Kahn, Daniel A. Roberts

Figure 1 for Topological Obstructions to Autoencoding

Figure 2 for Topological Obstructions to Autoencoding

Figure 3 for Topological Obstructions to Autoencoding

Figure 4 for Topological Obstructions to Autoencoding

Abstract:Autoencoders have been proposed as a powerful tool for model-independent anomaly detection in high-energy physics. The operating principle is that events which do not belong to the space of training data will be reconstructed poorly, thus flagging them as anomalies. We point out that in a variety of examples of interest, the connection between large reconstruction error and anomalies is not so clear. In particular, for data sets with nontrivial topology, there will always be points that erroneously seem anomalous due to global issues. Conversely, neural networks typically have an inductive bias or prior to locally interpolate such that undersampled or rare events may be reconstructed with small error, despite actually being the desired anomalies. Taken together, these facts are in tension with the simple picture of the autoencoder as an anomaly detector. Using a series of illustrative low-dimensional examples, we show explicitly how the intrinsic and extrinsic topology of the dataset affects the behavior of an autoencoder and how this topology is manifested in the latent space representation during training. We ground this analysis in the discussion of a mock "bump hunt" in which the autoencoder fails to identify an anomalous "signal" for reasons tied to the intrinsic topology of $n$-particle phase space.

* 24 + 20 pages, 26 figures; no autoencoders were harmed in the making of this project

Via

Access Paper or Ask Questions

Patch2Self: Denoising Diffusion MRI with Self-Supervised Learning

Nov 02, 2020

Shreyas Fadnavis, Joshua Batson, Eleftherios Garyfallidis

Figure 1 for Patch2Self: Denoising Diffusion MRI with Self-Supervised Learning

Figure 2 for Patch2Self: Denoising Diffusion MRI with Self-Supervised Learning

Figure 3 for Patch2Self: Denoising Diffusion MRI with Self-Supervised Learning

Figure 4 for Patch2Self: Denoising Diffusion MRI with Self-Supervised Learning

Abstract:Diffusion-weighted magnetic resonance imaging (DWI) is the only noninvasive method for quantifying microstructure and reconstructing white-matter pathways in the living human brain. Fluctuations from multiple sources create significant additive noise in DWI data which must be suppressed before subsequent microstructure analysis. We introduce a self-supervised learning method for denoising DWI data, Patch2Self, which uses the entire volume to learn a full-rank locally linear denoiser for that volume. By taking advantage of the oversampled q-space of DWI data, Patch2Self can separate structure from noise without requiring an explicit model for either. We demonstrate the effectiveness of Patch2Self via quantitative and qualitative improvements in microstructure modeling, tracking (via fiber bundle coherency) and model estimation relative to other unsupervised methods on real and simulated data.

* Thirty-fourth Conference on Neural Information Processing Systems, 2020

Via

Access Paper or Ask Questions

Image Deconvolution via Noise-Tolerant Self-Supervised Inversion

Jun 11, 2020

Hirofumi Kobayashi, Ahmet Can Solak, Joshua Batson, Loic A. Royer

Figure 1 for Image Deconvolution via Noise-Tolerant Self-Supervised Inversion

Figure 2 for Image Deconvolution via Noise-Tolerant Self-Supervised Inversion

Figure 3 for Image Deconvolution via Noise-Tolerant Self-Supervised Inversion

Figure 4 for Image Deconvolution via Noise-Tolerant Self-Supervised Inversion

Abstract:We propose a general framework for solving inverse problems in the presence of noise that requires no signal prior, no noise estimate, and no clean training data. We only require that the forward model be available and that the noise be statistically independent across measurement dimensions. We build upon the theory of $\mathcal{J}$-invariant functions (Batson & Royer 2019, arXiv:1901.11365) and show how self-supervised denoising \emph{\`a la} Noise2Self is a special case of learning a noise-tolerant pseudo-inverse of the identity. We demonstrate our approach by showing how a convolutional neural network can be taught in a self-supervised manner to deconvolve images and surpass in image quality classical inversion schemes such as Lucy-Richardson deconvolution.

Via

Access Paper or Ask Questions

Noise2Self: Blind Denoising by Self-Supervision

Jan 30, 2019

Joshua Batson, Loic Royer

Figure 1 for Noise2Self: Blind Denoising by Self-Supervision

Figure 2 for Noise2Self: Blind Denoising by Self-Supervision

Figure 3 for Noise2Self: Blind Denoising by Self-Supervision

Figure 4 for Noise2Self: Blind Denoising by Self-Supervision

Abstract:We propose a general framework for denoising high-dimensional measurements which requires no prior on the signal, no estimate of the noise, and no clean training data. The only assumption is that the noise exhibits statistical independence across different dimensions of the measurement. Moreover, our framework is not restricted to a particular denoising model. We show how it can be used to calibrate any parameterised denoising algorithm, from the single hyperparameter of a median filter to the millions of weights of a deep neural network. We demonstrate this on natural image and microscopy data, where we exploit noise independence between pixels, and on single-cell gene expression data, where we exploit independence between detections of individual molecules. Finally, we prove a theoretical lower bound on the performance of an optimal denoiser. This framework generalizes recent work on training neural nets from noisy images and on cross-validation for matrix factorization.

* 10 pages, 6 figures, and supplement

Via

Access Paper or Ask Questions