Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ilya Loshchilov

LIS

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Oct 01, 2024

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg

Figure 1 for nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Figure 2 for nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Figure 3 for nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Figure 4 for nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Abstract:We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

Via

Access Paper or Ask Questions

Weight Norm Control

Nov 21, 2023

Ilya Loshchilov

Abstract:We note that decoupled weight decay regularization is a particular case of weight norm control where the target norm of weights is set to 0. Any optimization method (e.g., Adam) which uses decoupled weight decay regularization (respectively, AdamW) can be viewed as a particular case of a more general algorithm with weight norm control (respectively, AdamWN). We argue that setting the target norm of weights to 0 can be suboptimal and other target norm values can be considered. For instance, any training run where AdamW achieves a particular norm of weights can be challenged by AdamWN scheduled to achieve a comparable norm of weights. We discuss various implications of introducing weight norm control instead of weight decay.

Via

Access Paper or Ask Questions

Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari

Feb 24, 2018

Patryk Chrabaszcz, Ilya Loshchilov, Frank Hutter

Figure 1 for Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari

Figure 2 for Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari

Figure 3 for Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari

Figure 4 for Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari

Abstract:Evolution Strategies (ES) have recently been demonstrated to be a viable alternative to reinforcement learning (RL) algorithms on a set of challenging deep RL problems, including Atari games and MuJoCo humanoid locomotion benchmarks. While the ES algorithms in that work belonged to the specialized class of natural evolution strategies (which resemble approximate gradient RL algorithms, such as REINFORCE), we demonstrate that even a very basic canonical ES algorithm can achieve the same or even better performance. This success of a basic ES algorithm suggests that the state-of-the-art can be advanced further by integrating the many advances made in the field of ES in the last decades. We also demonstrate qualitatively that ES algorithms have very different performance characteristics than traditional RL algorithms: on some games, they learn to exploit the environment and perform much better while on others they can get stuck in suboptimal local minima. Combining their strengths with those of traditional RL algorithms is therefore likely to lead to new advances in the state of the art.

Via

Access Paper or Ask Questions

Fixing Weight Decay Regularization in Adam

Feb 14, 2018

Ilya Loshchilov, Frank Hutter

Figure 1 for Fixing Weight Decay Regularization in Adam

Figure 2 for Fixing Weight Decay Regularization in Adam

Figure 3 for Fixing Weight Decay Regularization in Adam

Figure 4 for Fixing Weight Decay Regularization in Adam

Abstract:L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common deep learning frameworks of these algorithms implement L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). We also propose a version of Adam with warm restarts (AdamWR) that has strong anytime performance while achieving state-of-the-art results on CIFAR-10 and ImageNet32x32. Our source code is available at https://github.com/loshchil/AdamW-and-SGDW

Via

Access Paper or Ask Questions

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

Aug 23, 2017

Patryk Chrabaszcz, Ilya Loshchilov, Frank Hutter

Figure 1 for A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

Figure 2 for A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

Figure 3 for A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

Figure 4 for A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

Abstract:The original ImageNet dataset is a popular large-scale benchmark for training Deep Neural Networks. Since the cost of performing experiments (e.g, algorithm design, architecture search, and hyperparameter tuning) on the original dataset might be prohibitive, we propose to consider a downsampled version of ImageNet. In contrast to the CIFAR datasets and earlier downsampled versions of ImageNet, our proposed ImageNet32$\times$32 (and its variants ImageNet64$\times$64 and ImageNet16$\times$16) contains exactly the same number of classes and images as ImageNet, with the only difference that the images are downsampled to 32$\times$32 pixels per image (64$\times$64 and 16$\times$16 pixels for the variants, respectively). Experiments on these downsampled variants are dramatically faster than on the original ImageNet and the characteristics of the downsampled datasets with respect to optimal hyperparameters appear to remain similar. The proposed datasets and scripts to reproduce our results are available at http://image-net.org/download-images and https://github.com/PatrykChrabaszcz/Imagenet32_Scripts

Via

Access Paper or Ask Questions

Limited-Memory Matrix Adaptation for Large Scale Black-box Optimization

May 18, 2017

Ilya Loshchilov, Tobias Glasmachers, Hans-Georg Beyer

Figure 1 for Limited-Memory Matrix Adaptation for Large Scale Black-box Optimization

Figure 2 for Limited-Memory Matrix Adaptation for Large Scale Black-box Optimization

Figure 3 for Limited-Memory Matrix Adaptation for Large Scale Black-box Optimization

Figure 4 for Limited-Memory Matrix Adaptation for Large Scale Black-box Optimization

Abstract:The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a popular method to deal with nonconvex and/or stochastic optimization problems when the gradient information is not available. Being based on the CMA-ES, the recently proposed Matrix Adaptation Evolution Strategy (MA-ES) provides a rather surprising result that the covariance matrix and all associated operations (e.g., potentially unstable eigendecomposition) can be replaced in the CMA-ES by a updated transformation matrix without any loss of performance. In order to further simplify MA-ES and reduce its $\mathcal{O}\big(n^2\big)$ time and storage complexity to $\mathcal{O}\big(n\log(n)\big)$, we present the Limited-Memory Matrix Adaptation Evolution Strategy (LM-MA-ES) for efficient zeroth order large-scale optimization. The algorithm demonstrates state-of-the-art performance on a set of established large-scale benchmarks. We explore the algorithm on the problem of generating adversarial inputs for a (non-smooth) random forest classifier, demonstrating a surprising vulnerability of the classifier.

Via

Access Paper or Ask Questions

SGDR: Stochastic Gradient Descent with Warm Restarts

May 03, 2017

Ilya Loshchilov, Frank Hutter

Figure 1 for SGDR: Stochastic Gradient Descent with Warm Restarts

Figure 2 for SGDR: Stochastic Gradient Descent with Warm Restarts

Figure 3 for SGDR: Stochastic Gradient Descent with Warm Restarts

Figure 4 for SGDR: Stochastic Gradient Descent with Warm Restarts

Abstract:Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR

* ICLR 2017 conference paper

Via

Access Paper or Ask Questions

Anytime Bi-Objective Optimization with a Hybrid Multi-Objective CMA-ES (HMO-CMA-ES)

May 09, 2016

Ilya Loshchilov, Tobias Glasmachers

Figure 1 for Anytime Bi-Objective Optimization with a Hybrid Multi-Objective CMA-ES (HMO-CMA-ES)

Figure 2 for Anytime Bi-Objective Optimization with a Hybrid Multi-Objective CMA-ES (HMO-CMA-ES)

Figure 3 for Anytime Bi-Objective Optimization with a Hybrid Multi-Objective CMA-ES (HMO-CMA-ES)

Figure 4 for Anytime Bi-Objective Optimization with a Hybrid Multi-Objective CMA-ES (HMO-CMA-ES)

Abstract:We propose a multi-objective optimization algorithm aimed at achieving good anytime performance over a wide range of problems. Performance is assessed in terms of the hypervolume metric. The algorithm called HMO-CMA-ES represents a hybrid of several old and new variants of CMA-ES, complemented by BOBYQA as a warm start. We benchmark HMO-CMA-ES on the recently introduced bi-objective problem suite of the COCO framework (COmparing Continuous Optimizers), consisting of 55 scalable continuous optimization problems, which is used by the Black-Box Optimization Benchmarking (BBOB) Workshop 2016.

* BBOB workshop of GECCO'2016

Via

Access Paper or Ask Questions

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Apr 25, 2016

Ilya Loshchilov, Frank Hutter

Figure 1 for CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Figure 2 for CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Figure 3 for CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Figure 4 for CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Abstract:Hyperparameters of deep neural networks are often optimized by grid search, random search or Bayesian optimization. As an alternative, we propose to use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which is known for its state-of-the-art performance in derivative-free optimization. CMA-ES has some useful invariance properties and is friendly to parallel evaluations of solutions. We provide a toy example comparing CMA-ES and state-of-the-art Bayesian optimization algorithms for tuning the hyperparameters of a convolutional neural network for the MNIST dataset on 30 GPUs in parallel.

Via

Access Paper or Ask Questions

Online Batch Selection for Faster Training of Neural Networks

Apr 25, 2016

Ilya Loshchilov, Frank Hutter

Figure 1 for Online Batch Selection for Faster Training of Neural Networks

Figure 2 for Online Batch Selection for Faster Training of Neural Networks

Figure 3 for Online Batch Selection for Faster Training of Neural Networks

Abstract:Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.

* Workshop paper at ICLR 2016

Via

Access Paper or Ask Questions