Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicolas Vasilache

MLIR: A Compiler Infrastructure for the End of Moore's Law

Mar 01, 2020

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, Oleksandr Zinenko

Figure 1 for MLIR: A Compiler Infrastructure for the End of Moore's Law

Figure 2 for MLIR: A Compiler Infrastructure for the End of Moore's Law

Figure 3 for MLIR: A Compiler Infrastructure for the End of Moore's Law

Figure 4 for MLIR: A Compiler Infrastructure for the End of Moore's Law

Abstract:This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and also across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, and identifying the challenges and opportunities posed by this novel design point in design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers-describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics.

Via

Access Paper or Ask Questions

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Jun 29, 2018

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen

Figure 1 for Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Figure 2 for Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Figure 3 for Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Figure 4 for Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Abstract:Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrapping high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. [Abstract cutoff]

Via

Access Paper or Ask Questions

Diagonal Rescaling For Neural Networks

May 25, 2017

Jean Lafond, Nicolas Vasilache, Léon Bottou

Figure 1 for Diagonal Rescaling For Neural Networks

Figure 2 for Diagonal Rescaling For Neural Networks

Figure 3 for Diagonal Rescaling For Neural Networks

Abstract:We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tricks such as fanin stepsize scaling. The second insight stresses the practical importance of dealing with fast changes of the curvature of the cost.

Via

Access Paper or Ask Questions

Training Language Models Using Target-Propagation

Feb 15, 2017

Sam Wiseman, Sumit Chopra, Marc'Aurelio Ranzato, Arthur Szlam, Ruoyu Sun, Soumith Chintala, Nicolas Vasilache

Figure 1 for Training Language Models Using Target-Propagation

Figure 2 for Training Language Models Using Target-Propagation

Figure 3 for Training Language Models Using Target-Propagation

Figure 4 for Training Language Models Using Target-Propagation

Abstract:While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experiments suggest that TPROP generally underperforms BPTT, and we end with an analysis of this phenomenon, and suggestions for future work.

Via

Access Paper or Ask Questions

Learning Visual Features from Large Weakly Supervised Data

Nov 06, 2015

Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache

Figure 1 for Learning Visual Features from Large Weakly Supervised Data

Figure 2 for Learning Visual Features from Large Weakly Supervised Data

Figure 3 for Learning Visual Features from Large Weakly Supervised Data

Figure 4 for Learning Visual Features from Large Weakly Supervised Data

Abstract:Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages.

Via

Access Paper or Ask Questions

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Apr 10, 2015

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun

Figure 1 for Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Figure 2 for Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Figure 3 for Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Figure 4 for Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Abstract:We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA's cuDNN implementation for many common convolutional layers (up to 23.5x for some synthetic kernel configurations). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. Details on algorithmic applications of NVIDIA GPU hardware specifics in the implementation of fbfft are also provided.

* Camera ready for ICLR2015

Via

Access Paper or Ask Questions