Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhexi Zhang

Understanding Parameter Sharing in Transformers

Jun 15, 2023

Ye Lin, Mingxuan Wang, Zhexi Zhang, Xiaohui Wang, Tong Xiao, Jingbo Zhu

Figure 1 for Understanding Parameter Sharing in Transformers

Figure 2 for Understanding Parameter Sharing in Transformers

Figure 3 for Understanding Parameter Sharing in Transformers

Figure 4 for Understanding Parameter Sharing in Transformers

Abstract:Parameter sharing has proven to be a parameter-efficient approach. Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth. In this paper, we study why this approach works from two perspectives. First, increasing model depth makes the model more complex, and we hypothesize that the reason is related to model complexity (referring to FLOPs). Secondly, since each shared parameter will participate in the network computation several times in forward propagation, its corresponding gradient will have a different range of values from the original model, which will affect the model convergence. Based on this, we hypothesize that training convergence may also be one of the reasons. Through further analysis, we show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity. Inspired by this, we tune the training hyperparameters related to model convergence in a targeted manner. Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.

Via

Access Paper or Ask Questions

MobileNMT: Enabling Translation in 15MB and 30ms

Jun 07, 2023

Ye Lin, Xiaohui Wang, Zhexi Zhang, Mingxuan Wang, Tong Xiao, Jingbo Zhu

Abstract:Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.

* accepted by ACL2023 Industry Track

Via

Access Paper or Ask Questions

PARAGEN : A Parallel Generation Toolkit

Oct 07, 2022

Jiangtao Feng, Yi Zhou, Jun Zhang, Xian Qian, Liwei Wu, Zhexi Zhang, Yanming Liu, Mingxuan Wang, Lei Li, Hao Zhou

Figure 1 for PARAGEN : A Parallel Generation Toolkit

Figure 2 for PARAGEN : A Parallel Generation Toolkit

Figure 3 for PARAGEN : A Parallel Generation Toolkit

Figure 4 for PARAGEN : A Parallel Generation Toolkit

Abstract:PARAGEN is a PyTorch-based NLP toolkit for further development on parallel generation. PARAGEN provides thirteen types of customizable plugins, helping users to experiment quickly with novel ideas across model architectures, optimization, and learning strategies. We implement various features, such as unlimited data loading and automatic model selection, to enhance its industrial usage. ParaGen is now deployed to support various research and industry applications at ByteDance. PARAGEN is available at https://github.com/bytedance/ParaGen.

* 9 pages, 1 figure, 6 tables

Via

Access Paper or Ask Questions

ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradients Accumulation

Nov 23, 2020

Xiaoxing Wang, Xiangxiang Chu, Yuda Fan, Zhexi Zhang, Xiaolin Wei, Junchi Yan, Xiaokang Yang

Figure 1 for ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradients Accumulation

Figure 2 for ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradients Accumulation

Figure 3 for ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradients Accumulation

Figure 4 for ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradients Accumulation

Abstract:Single-path based differentiable neural architecture search has great strengths for its low computational cost and memory-friendly nature. However, we surprisingly discover that it suffers from severe searching instability which has been primarily ignored, posing a potential weakness for a wider application. In this paper, we delve into its performance collapse issue and propose a new algorithm called RObustifying Memory-Efficient NAS (ROME). Specifically, 1) for consistent topology in the search and evaluation stage, we involve separate parameters to disentangle the topology from the operations of the architecture. In such a way, we can independently sample connections and operations without interference; 2) to discount sampling unfairness and variance, we enforce fair sampling for weight update and apply a gradient accumulation mechanism for architecture parameters. Extensive experiments demonstrate that our proposed method has strong performance and robustness, where it mostly achieves state-of-the-art results on a large number of standard benchmarks.

* Observe new collapse in memory efficient NAS and address it using ROME

Via

Access Paper or Ask Questions