Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aritra Das

Grokking Modular Polynomials

Jun 05, 2024

Darshil Doshi, Tianyu He, Aritra Das, Andrey Gromov

Figure 1 for Grokking Modular Polynomials

Figure 2 for Grokking Modular Polynomials

Figure 3 for Grokking Modular Polynomials

Abstract:Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest. This limitation remains unmoved by the choice of architecture and training strategies. On the other hand, an analytical solution for the weights of Multi-layer Perceptron (MLP) networks that generalize on the modular addition task is known in the literature. In this work, we (i) extend the class of analytical solutions to include modular multiplication as well as modular addition with many terms. Additionally, we show that real networks trained on these datasets learn similar solutions upon generalization (grokking). (ii) We combine these "expert" solutions to construct networks that generalize on arbitrary modular polynomials. (iii) We hypothesize a classification of modular polynomials into learnable and non-learnable via neural networks training; and provide experimental evidence supporting our claims.

* 7+4 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Jun 04, 2024

Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov

Figure 1 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Figure 2 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Figure 3 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Figure 4 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Abstract:Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a \, x + b \, y \;\mathrm{mod}\; p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emph{transient}, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing the highly structured representations in both phases; and discuss the learnt algorithm.

* 21 pages, 19 figures

Via

Access Paper or Ask Questions

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

Oct 19, 2023

Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov

Abstract:Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider two-layer neural networks trained on modular arithmetic tasks where ($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes the \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where train accuracy suddenly jumps from $100\%$ to $100 (1-\xi)\%$.

* 24 pages, 22 figures, 2 tables

Via

Access Paper or Ask Questions

Combining Multi-level Contexts of Superpixel using Convolutional Neural Networks to perform Natural Scene Labeling

Mar 14, 2018

Aritra Das, Swarnendu Ghosh, Ritesh Sarkhel, Sandipan Choudhuri, Nibaran Das, Mita Nasipuri

Figure 1 for Combining Multi-level Contexts of Superpixel using Convolutional Neural Networks to perform Natural Scene Labeling

Figure 2 for Combining Multi-level Contexts of Superpixel using Convolutional Neural Networks to perform Natural Scene Labeling

Figure 3 for Combining Multi-level Contexts of Superpixel using Convolutional Neural Networks to perform Natural Scene Labeling

Figure 4 for Combining Multi-level Contexts of Superpixel using Convolutional Neural Networks to perform Natural Scene Labeling

Abstract:Modern deep learning algorithms have triggered various image segmentation approaches. However most of them deal with pixel based segmentation. However, superpixels provide a certain degree of contextual information while reducing computation cost. In our approach, we have performed superpixel level semantic segmentation considering 3 various levels as neighbours for semantic contexts. Furthermore, we have enlisted a number of ensemble approaches like max-voting and weighted-average. We have also used the Dempster-Shafer theory of uncertainty to analyze confusion among various classes. Our method has proved to be superior to a number of different modern approaches on the same dataset.

* Accepted for publication in the Proceedings of Second International Conference on Computing and Communication 2018, Sikkim Manipal Institute of Technology, Sikkim, India

Via

Access Paper or Ask Questions