Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Jul 18, 2024

Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha

Figure 1 for Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Figure 2 for Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Figure 3 for Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Figure 4 for Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Share this with someone who'll enjoy it:

Abstract:Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We use these axioms to guide the mechanistic interpretability analysis of a Transformer-based model trained to solve the well-known 2-SAT problem. We are able to reverse engineer the algorithm learned by the model -- the model first parses the input formulas and then evaluates their satisfiability via enumeration of different possible valuations of the Boolean input variables. We also present evidence to support that the mechanistic interpretation of the analyzed model indeed satisfies the stated axioms.

View paper on

Share this with someone who'll enjoy it:

Title:Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Paper and Code