Abstract:What types of numeric representations emerge in Neural Networks (NNs)? To what degree do NNs induce abstract, mutable, slot-like numeric variables, and in what situations do these representations emerge? How do these representations change over learning, and how can we understand the neural implementations in ways that are unified across different NNs? In this work, we approach these questions by first training sequence based neural systems using Next Token Prediction (NTP) objectives on numeric tasks. We then seek to understand the neural solutions through the lens of causal abstractions or symbolic algorithms. We use a combination of causal interventions and visualization methods to find that artificial neural models do indeed develop analogs of interchangeable, mutable, latent number variables purely from the NTP objective. We then ask how variations on the tasks and model architectures affect the models' learned solutions to find that these symbol-like numeric representations do not form for every variant of the task, and transformers solve the problem in a notably different way than their recurrent counterparts. We then show how the symbol-like variables change over the course of training to find a strong correlation between the models' task performance and the alignment of their symbol-like representations. Lastly, we show that in all cases, some degree of gradience exists in these neural symbols, highlighting the difficulty of finding simple, interpretable symbolic stories of how neural networks perform numeric tasks. Taken together, our results are consistent with the view that neural networks can approximate interpretable symbolic programs of number cognition, but the particular program they approximate and the extent to which they approximate it can vary widely, depending on the network architecture, training data, extent of training, and network size.
Abstract:When can we say that two neural systems are the same? The answer to this question is goal-dependent, and it is often addressed through correlative methods such as Representational Similarity Analysis (RSA) and Centered Kernel Alignment (CKA). What do we miss when we forgo causal explorations, and how can we target specific types of similarity? In this work, we introduce Model Alignment Search (MAS), a method for causally exploring distributed representational similarity. The method learns invertible linear transformations that align a subspace between two distributed networks' representations where causal information can be freely interchanged. We first show that the method can be used to transfer specific causal variables, such as the number of items in a counting task, between networks with different training seeds. We then explore open questions in number cognition by comparing different types of numeric representations in models trained on structurally different numeric tasks. We then explore differences between MAS vs preexisting causal similarity methods, showing MAS to be more resistant to unwanted exchanges. Lastly, we introduce a counterfactual latent auxiliary loss function that helps shape causally relevant alignments even in cases where we do not have causal access to one of the two models for training.