Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ye Lin

Orthogonal greedy algorithm for linear operator learning with shallow neural network

Jan 06, 2025

Ye Lin, Jiwei Jia, Young Ju Lee, Ran Zhang

Figure 1 for Orthogonal greedy algorithm for linear operator learning with shallow neural network

Figure 2 for Orthogonal greedy algorithm for linear operator learning with shallow neural network

Figure 3 for Orthogonal greedy algorithm for linear operator learning with shallow neural network

Figure 4 for Orthogonal greedy algorithm for linear operator learning with shallow neural network

Abstract:Greedy algorithms, particularly the orthogonal greedy algorithm (OGA), have proven effective in training shallow neural networks for fitting functions and solving partial differential equations (PDEs). In this paper, we extend the application of OGA to the tasks of linear operator learning, which is equivalent to learning the kernel function through integral transforms. Firstly, a novel greedy algorithm is developed for kernel estimation rate in a new semi-inner product, which can be utilized to approximate the Green's function of linear PDEs from data. Secondly, we introduce the OGA for point-wise kernel estimation to further improve the approximation rate, achieving orders of accuracy improvement across various tasks and baseline models. In addition, we provide a theoretical analysis on the kernel estimation problem and the optimal approximation rates for both algorithms, establishing their efficacy and potential for future applications in PDEs and operator learning tasks.

Via

Access Paper or Ask Questions

Green Multigrid Network

Jul 04, 2024

Ye Lin, Young Ju Lee, Jiwei Jia

Abstract:GreenLearning networks (GL) directly learn Green's function in physical space, making them an interpretable model for capturing unknown solution operators of partial differential equations (PDEs). For many PDEs, the corresponding Green's function exhibits asymptotic smoothness. In this paper, we propose a framework named Green Multigrid networks (GreenMGNet), an operator learning algorithm designed for a class of asymptotically smooth Green's functions. Compared with the pioneering GL, the new framework presents itself with better accuracy and efficiency, thereby achieving a significant improvement. GreenMGNet is composed of two technical novelties. First, Green's function is modeled as a piecewise function to take into account its singular behavior in some parts of the hyperplane. Such piecewise function is then approximated by a neural network with augmented output(AugNN) so that it can capture singularity accurately. Second, the asymptotic smoothness property of Green's function is used to leverage the Multi-Level Multi-Integration (MLMI) algorithm for both the training and inference stages. Several test cases of operator learning are presented to demonstrate the accuracy and effectiveness of the proposed method. On average, GreenMGNet achieves $3.8\%$ to $39.15\%$ accuracy improvement. To match the accuracy level of GL, GreenMGNet requires only about $10\%$ of the full grid data, resulting in a $55.9\%$ and $92.5\%$ reduction in training time and GPU memory cost for one-dimensional test problems, and a $37.7\%$ and $62.5\%$ reduction for two-dimensional test problems.

Via

Access Paper or Ask Questions

Understanding Parameter Sharing in Transformers

Jun 15, 2023

Ye Lin, Mingxuan Wang, Zhexi Zhang, Xiaohui Wang, Tong Xiao, Jingbo Zhu

Figure 1 for Understanding Parameter Sharing in Transformers

Figure 2 for Understanding Parameter Sharing in Transformers

Figure 3 for Understanding Parameter Sharing in Transformers

Figure 4 for Understanding Parameter Sharing in Transformers

Abstract:Parameter sharing has proven to be a parameter-efficient approach. Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth. In this paper, we study why this approach works from two perspectives. First, increasing model depth makes the model more complex, and we hypothesize that the reason is related to model complexity (referring to FLOPs). Secondly, since each shared parameter will participate in the network computation several times in forward propagation, its corresponding gradient will have a different range of values from the original model, which will affect the model convergence. Based on this, we hypothesize that training convergence may also be one of the reasons. Through further analysis, we show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity. Inspired by this, we tune the training hyperparameters related to model convergence in a targeted manner. Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.

Via

Access Paper or Ask Questions

MobileNMT: Enabling Translation in 15MB and 30ms

Jun 07, 2023

Ye Lin, Xiaohui Wang, Zhexi Zhang, Mingxuan Wang, Tong Xiao, Jingbo Zhu

Figure 1 for MobileNMT: Enabling Translation in 15MB and 30ms

Figure 2 for MobileNMT: Enabling Translation in 15MB and 30ms

Figure 3 for MobileNMT: Enabling Translation in 15MB and 30ms

Figure 4 for MobileNMT: Enabling Translation in 15MB and 30ms

Abstract:Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.

* accepted by ACL2023 Industry Track

Via

Access Paper or Ask Questions

Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

May 10, 2023

Ye Lin, Shuhan Zhou, Yanyang Li, Anxiang Ma, Tong Xiao, Jingbo Zhu

Figure 1 for Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

Figure 2 for Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

Figure 3 for Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

Figure 4 for Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

Abstract:For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.

* accepted by EMNLP2022 Findings

Via

Access Paper or Ask Questions

The NiuTrans System for WNGT 2020 Efficiency Task

Sep 16, 2021

Chi Hu, Bei Li, Ye Lin, Yinqiao Li, Yanyang Li, Chenglong Wang, Tong Xiao, Jingbo Zhu

Figure 1 for The NiuTrans System for WNGT 2020 Efficiency Task

Figure 2 for The NiuTrans System for WNGT 2020 Efficiency Task

Figure 3 for The NiuTrans System for WNGT 2020 Efficiency Task

Figure 4 for The NiuTrans System for WNGT 2020 Efficiency Task

Abstract:This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models \cite{wang-etal-2019-learning, li-etal-2019-niutrans} using NiuTensor (https://github.com/NiuTrans/NiuTensor), a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on \textit{newstest2018}. The code, models, and docker images are available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).

* NiuTrans at the WNGT 2020 Translation Efficiency Task

Via

Access Paper or Ask Questions

The NiuTrans System for the WMT21 Efficiency Task

Sep 16, 2021

Chenglong Wang, Chi Hu, Yongyu Mu, Zhongxiang Yan, Siming Wu, Minyi Hu, Hang Cao, Bei Li, Ye Lin, Tong Xiao(+1 more)

Figure 1 for The NiuTrans System for the WMT21 Efficiency Task

Figure 2 for The NiuTrans System for the WMT21 Efficiency Task

Figure 3 for The NiuTrans System for the WMT21 Efficiency Task

Figure 4 for The NiuTrans System for the WMT21 Efficiency Task

Abstract:This paper describes the NiuTrans system for the WMT21 translation efficiency task (http://statmt.org/wmt21/efficiency-task.html). Following last year's work, we explore various techniques to improve efficiency while maintaining translation quality. We investigate the combinations of lightweight Transformer architectures and knowledge distillation strategies. Also, we improve the translation efficiency with graph optimization, low precision, dynamic batching, and parallel pre/post-processing. Our system can translate 247,000 words per second on an NVIDIA A100, being 3$\times$ faster than last year's system. Our system is the fastest and has the lowest memory consumption on the GPU-throughput track. The code, model, and pipeline will be available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).

* NiuTrans at the WMT21 Translation Efficiency Task

Via

Access Paper or Ask Questions

Bag of Tricks for Optimizing Transformer Efficiency

Sep 09, 2021

Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu

Figure 1 for Bag of Tricks for Optimizing Transformer Efficiency

Figure 2 for Bag of Tricks for Optimizing Transformer Efficiency

Figure 3 for Bag of Tricks for Optimizing Transformer Efficiency

Figure 4 for Bag of Tricks for Optimizing Transformer Efficiency

Abstract:Improving Transformer efficiency has become increasingly attractive recently. A wide range of methods has been proposed, e.g., pruning, quantization, new architectures and etc. But these methods are either sophisticated in implementation or dependent on hardware. In this paper, we show that the efficiency of Transformer can be improved by combining some simple and hardware-agnostic methods, including tuning hyper-parameters, better design choices and training strategies. On the WMT news translation tasks, we improve the inference efficiency of a strong Transformer system by 3.80X on CPU and 2.52X on GPU. The code is publicly available at https://github.com/Lollipop321/mini-decoder-network.

* accepted by EMNLP (Findings) 2021

Via

Access Paper or Ask Questions

An Efficient Transformer Decoder with Compressed Sub-layers

Jan 03, 2021

Yanyang Li, Ye Lin, Tong Xiao, Jingbo Zhu

Figure 1 for An Efficient Transformer Decoder with Compressed Sub-layers

Figure 2 for An Efficient Transformer Decoder with Compressed Sub-layers

Figure 3 for An Efficient Transformer Decoder with Compressed Sub-layers

Figure 4 for An Efficient Transformer Decoder with Compressed Sub-layers

Abstract:The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42x faster with performance on par with a strong baseline. This strong baseline is already 2x faster than the widely used standard baseline without loss in performance.

* accepted by AAAI2021

Via

Access Paper or Ask Questions

A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction

Nov 30, 2020

Yanyang Li, Yingfeng Luo, Ye Lin, Quan Du, Huizhen Wang, Shujian Huang, Tong Xiao, Jingbo Zhu

Figure 1 for A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction

Figure 2 for A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction

Figure 3 for A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction

Figure 4 for A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction

Abstract:Unsupervised Bilingual Dictionary Induction methods based on the initialization and the self-learning have achieved great success in similar language pairs, e.g., English-Spanish. But they still fail and have an accuracy of 0% in many distant language pairs, e.g., English-Japanese. In this work, we show that this failure results from the gap between the actual initialization performance and the minimum initialization performance for the self-learning to succeed. We propose Iterative Dimension Reduction to bridge this gap. Our experiments show that this simple method does not hamper the performance of similar language pairs and achieves an accuracy of 13.64~55.53% between English and four distant languages, i.e., Chinese, Japanese, Vietnamese and Thai.

* Accepted by COLING2020

Via

Access Paper or Ask Questions