Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Engineering Monosemanticity in Toy Models

Nov 16, 2022

Adam S. Jermyn, Nicholas Schiefer, Evan Hubinger

Figure 1 for Engineering Monosemanticity in Toy Models

Figure 2 for Engineering Monosemanticity in Toy Models

Figure 3 for Engineering Monosemanticity in Toy Models

Figure 4 for Engineering Monosemanticity in Toy Models

Share this with someone who'll enjoy it:

Abstract:In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in interpretability studies, as they can be cleanly understood. In this work we report preliminary attempts to engineer monosemanticity in toy models. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. More monosemantic loss minima have moderate negative biases, and we are able to use this fact to engineer highly monosemantic models. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm. Finally, we find that providing models with more neurons per layer makes the models more monosemantic, albeit at increased computational cost. These findings point to a number of new questions and avenues for engineering monosemanticity, which we intend to study these in future work.

* 31 pages, 26 figures

View paper on

Share this with someone who'll enjoy it:

Title:Engineering Monosemanticity in Toy Models

Paper and Code