Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiangning Chen

Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Oct 11, 2023

Zixiang Chen, Junkai Zhang, Yiwen Kou, Xiangning Chen, Cho-Jui Hsieh, Quanquan Gu

Figure 1 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Figure 2 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Figure 3 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Figure 4 for Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

Abstract:The challenge of overfitting, in which the model memorizes the training data and fails to generalize to test data, has become increasingly significant in the training of large neural networks. To tackle this challenge, Sharpness-Aware Minimization (SAM) has emerged as a promising training method, which can improve the generalization of neural networks even in the presence of label noise. However, a deep understanding of how SAM works, especially in the setting of nonlinear neural networks and classification tasks, remains largely missing. This paper fills this gap by demonstrating why SAM generalizes better than Stochastic Gradient Descent (SGD) for a certain data model and two-layer convolutional ReLU networks. The loss landscape of our studied problem is nonsmooth, thus current explanations for the success of SAM based on the Hessian information are insufficient. Our result explains the benefits of SAM, particularly its ability to prevent noise learning in the early stages, thereby facilitating more effective learning of features. Experiments on both synthetic and real data corroborate our theory.

* 52 pages, 4 figures, 2 tables. In NeurIPS 2023

Via

Access Paper or Ask Questions

Red Teaming Language Model Detectors with Language Models

May 31, 2023

Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, Cho-Jui Hsieh

Abstract:The prevalence and high capacity of large language models (LLMs) present significant safety and ethical risks when malicious users exploit them for automated content generation. To prevent the potentially deceptive usage of LLMs, recent works have proposed several algorithms to detect machine-generated text. In this paper, we systematically test the reliability of the existing detectors, by designing two types of attack strategies to fool the detectors: 1) replacing words with their synonyms based on the context; 2) altering the writing style of generated text. These strategies are implemented by instructing LLMs to generate synonymous word substitutions or writing directives that modify the style without human involvement, and the LLMs leveraged in the attack can also be protected by detectors. Our research reveals that our attacks effectively compromise the performance of all tested detectors, thereby underscoring the urgent need for the development of more robust machine-generated text detection systems.

* Work in progress. Zhouxing Shi, Yihan Wang and Fan Yin are ordered alphabetically

Via

Access Paper or Ask Questions

Symbol tuning improves in-context learning in language models

May 15, 2023

Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma(+1 more)

Figure 1 for Symbol tuning improves in-context learning in language models

Figure 2 for Symbol tuning improves in-context learning in language models

Figure 3 for Symbol tuning improves in-context learning in language models

Figure 4 for Symbol tuning improves in-context learning in language models

Abstract:We present symbol tuning - finetuning language models on in-context input-label pairs where natural language labels (e.g., "positive/negative sentiment") are replaced with arbitrary symbols (e.g., "foo/bar"). Symbol tuning leverages the intuition that when a model cannot use instructions or natural language labels to figure out a task, it must instead do so by learning the input-label mappings. We experiment with symbol tuning across Flan-PaLM models up to 540B parameters and observe benefits across various settings. First, symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels. Second, symbol-tuned models are much stronger at algorithmic reasoning tasks, with up to 18.2% better performance on the List Functions benchmark and up to 15.3% better performance on the Simple Turing Concepts benchmark. Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior semantic knowledge.

Via

Access Paper or Ask Questions

Symbolic Discovery of Optimization Algorithms

Feb 17, 2023

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh(+2 more)

Figure 1 for Symbolic Discovery of Optimization Algorithms

Figure 2 for Symbolic Discovery of Optimization Algorithms

Figure 3 for Symbolic Discovery of Optimization Algorithms

Figure 4 for Symbolic Discovery of Optimization Algorithms

Abstract:We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, $\textbf{Lion}$ ($\textit{Evo$\textbf{L}$ved S$\textbf{i}$gn M$\textbf{o}$me$\textbf{n}$tum}$). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% $\textit{zero-shot}$ and 91.1% $\textit{fine-tuning}$ accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. The implementation of Lion is publicly available.

* 29 pages, we clarified the recommended learning rate and added some references

Via

Access Paper or Ask Questions

Towards Efficient and Scalable Sharpness-Aware Minimization

Mar 05, 2022

Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, Yang You

Figure 1 for Towards Efficient and Scalable Sharpness-Aware Minimization

Figure 2 for Towards Efficient and Scalable Sharpness-Aware Minimization

Figure 3 for Towards Efficient and Scalable Sharpness-Aware Minimization

Figure 4 for Towards Efficient and Scalable Sharpness-Aware Minimization

Abstract:Recently, Sharpness-Aware Minimization (SAM), which connects the geometry of the loss landscape and generalization, has demonstrated significant performance boosts on training large-scale models such as vision transformers. However, the update rule of SAM requires two sequential (non-parallelizable) gradient computations at each step, which can double the computational overhead. In this paper, we propose a novel algorithm LookSAM - that only periodically calculates the inner gradient ascent, to significantly reduce the additional training cost of SAM. The empirical results illustrate that LookSAM achieves similar accuracy gains to SAM while being tremendously faster - it enjoys comparable computational complexity with first-order optimizers such as SGD or Adam. To further evaluate the performance and scalability of LookSAM, we incorporate a layer-wise modification and perform experiments in the large-batch training scenario, which is more prone to converge to sharp local minima. We are the first to successfully scale up the batch size when training Vision Transformers (ViTs). With a 64k batch size, we are able to train ViTs from scratch in minutes while maintaining competitive performance.

* Accepted by CVPR 2022

Via

Access Paper or Ask Questions

Can Vision Transformers Perform Convolution?

Nov 03, 2021

Shanda Li, Xiangning Chen, Di He, Cho-Jui Hsieh

Figure 1 for Can Vision Transformers Perform Convolution?

Figure 2 for Can Vision Transformers Perform Convolution?

Figure 3 for Can Vision Transformers Perform Convolution?

Abstract:Several recent studies have demonstrated that attention-based networks, such as Vision Transformer (ViT), can outperform Convolutional Neural Networks (CNNs) on several computer vision tasks without using convolutional layers. This naturally leads to the following questions: Can a self-attention layer of ViT express any convolution operation? In this work, we prove that a single ViT layer with image patches as the input can perform any convolution operation constructively, where the multi-head attention mechanism and the relative positional encoding play essential roles. We further provide a lower bound on the number of heads for Vision Transformers to express CNNs. Corresponding with our analysis, experimental results show that the construction in our proof can help inject convolutional bias into Transformers and significantly improve the performance of ViT in low data regimes.

Via

Access Paper or Ask Questions

RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Aug 18, 2021

Ruochen Wang, Xiangning Chen, Minhao Cheng, Xiaocheng Tang, Cho-Jui Hsieh

Figure 1 for RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Figure 2 for RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Figure 3 for RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Figure 4 for RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Abstract:Predictor-based algorithms have achieved remarkable performance in the Neural Architecture Search (NAS) tasks. However, these methods suffer from high computation costs, as training the performance predictor usually requires training and evaluating hundreds of architectures from scratch. Previous works along this line mainly focus on reducing the number of architectures required to fit the predictor. In this work, we tackle this challenge from a different perspective - improve search efficiency by cutting down the computation budget of architecture training. We propose NOn-uniform Successive Halving (NOSH), a hierarchical scheduling algorithm that terminates the training of underperforming architectures early to avoid wasting budget. To effectively leverage the non-uniform supervision signals produced by NOSH, we formulate predictor-based architecture search as learning to rank with pairwise comparisons. The resulting method - RANK-NOSH, reduces the search budget by ~5x while achieving competitive or even better performance than previous state-of-the-art predictor-based methods on various spaces and datasets.

* To Appear in ICCV2021. The code will be released shortly at https://github.com/ruocwang

Via

Access Paper or Ask Questions

Rethinking Architecture Selection in Differentiable NAS

Aug 10, 2021

Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, Cho-Jui Hsieh

Figure 1 for Rethinking Architecture Selection in Differentiable NAS

Figure 2 for Rethinking Architecture Selection in Differentiable NAS

Figure 3 for Rethinking Architecture Selection in Differentiable NAS

Figure 4 for Rethinking Architecture Selection in Differentiable NAS

Abstract:Differentiable Neural Architecture Search is one of the most popular Neural Architecture Search (NAS) methods for its search efficiency and simplicity, accomplished by jointly optimizing the model weight and architecture parameters in a weight-sharing supernet via gradient-based algorithms. At the end of the search phase, the operations with the largest architecture parameters will be selected to form the final architecture, with the implicit assumption that the values of architecture parameters reflect the operation strength. While much has been discussed about the supernet's optimization, the architecture selection process has received little attention. We provide empirical and theoretical analysis to show that the magnitude of architecture parameters does not necessarily indicate how much the operation contributes to the supernet's performance. We propose an alternative perturbation-based architecture selection that directly measures each operation's influence on the supernet. We re-evaluate several differentiable NAS methods with the proposed architecture selection and find that it is able to extract significantly improved architectures from the underlying supernets consistently. Furthermore, we find that several failure modes of DARTS can be greatly alleviated with the proposed selection method, indicating that much of the poor generalization observed in DARTS can be attributed to the failure of magnitude-based architecture selection rather than entirely the optimization of its supernet.

* Outstanding Paper Award at ICLR 2021. The code is available at https://github.com/ruocwang/darts-pt

Via

Access Paper or Ask Questions

When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Jun 03, 2021

Xiangning Chen, Cho-Jui Hsieh, Boqing Gong

Figure 1 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Figure 2 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Figure 3 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Figure 4 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Abstract:Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.

Via

Access Paper or Ask Questions

Concurrent Adversarial Learning for Large-Batch Training

Jun 01, 2021

Yong Liu, Xiangning Chen, Minhao Cheng, Cho-Jui Hsieh, Yang You

Figure 1 for Concurrent Adversarial Learning for Large-Batch Training

Figure 2 for Concurrent Adversarial Learning for Large-Batch Training

Figure 3 for Concurrent Adversarial Learning for Large-Batch Training

Figure 4 for Concurrent Adversarial Learning for Large-Batch Training

Abstract:Large-batch training has become a commonly used technique when training neural networks with a large number of GPU/TPU processors. As batch size increases, stochastic optimizers tend to converge to sharp local minima, leading to degraded test performance. Current methods usually use extensive data augmentation to increase the batch size, but we found the performance gain with data augmentation decreases as batch size increases, and data augmentation will become insufficient after certain point. In this paper, we propose to use adversarial learning to increase the batch size in large-batch training. Despite being a natural choice for smoothing the decision surface and biasing towards a flat region, adversarial learning has not been successfully applied in large-batch training since it requires at least two sequential gradient computations at each step, which will at least double the running time compared with vanilla training even with a large number of processors. To overcome this issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that decouple the sequential gradient computations in adversarial learning by utilizing staled parameters. Experimental results demonstrate that ConAdv can successfully increase the batch size on both ResNet-50 and EfficientNet training on ImageNet while maintaining high accuracy. In particular, we show ConAdv along can achieve 75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and the accuracy can be further improved to 76.2\% when combining ConAdv with data augmentation. This is the first work successfully scales ResNet-50 training batch size to 96K.

Via

Access Paper or Ask Questions