Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vinay Rao

Distributed Multigrid Neural Solvers on Megavoxel Domains

Apr 29, 2021

Aditya Balu, Sergio Botelho, Biswajit Khara, Vinay Rao, Chinmay Hegde, Soumik Sarkar, Santi Adavani, Adarsh Krishnamurthy, Baskar Ganapathysubramanian

Figure 1 for Distributed Multigrid Neural Solvers on Megavoxel Domains

Figure 2 for Distributed Multigrid Neural Solvers on Megavoxel Domains

Figure 3 for Distributed Multigrid Neural Solvers on Megavoxel Domains

Figure 4 for Distributed Multigrid Neural Solvers on Megavoxel Domains

Abstract:We consider the distributed training of large-scale neural networks that serve as PDE solvers producing full field outputs. We specifically consider neural solvers for the generalized 3D Poisson equation over megavoxel domains. A scalable framework is presented that integrates two distinct advances. First, we accelerate training a large model via a method analogous to the multigrid technique used in numerical linear algebra. Here, the network is trained using a hierarchy of increasing resolution inputs in sequence, analogous to the 'V', 'W', 'F', and 'Half-V' cycles used in multigrid approaches. In conjunction with the multi-grid approach, we implement a distributed deep learning framework which significantly reduces the time to solve. We show the scalability of this approach on both GPU (Azure VMs on Cloud) and CPU clusters (PSC Bridges2). This approach is deployed to train a generalized 3D Poisson solver that scales well to predict output full-field solutions up to the resolution of 512x512x512 for a high dimensional family of inputs.

Via

Access Paper or Ask Questions

WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web

Feb 18, 2021

Robert Ormandi, Mohammad Saleh, Erin Winter, Vinay Rao

Figure 1 for WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web

Figure 2 for WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web

Figure 3 for WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web

Figure 4 for WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web

Abstract:Relation extraction is used to populate knowledge bases that are important to many applications. Prior datasets used to train relation extraction models either suffer from noisy labels due to distant supervision, are limited to certain domains or are too small to train high-capacity models. This constrains downstream applications of relation extraction. We therefore introduce: WebRED (Web Relation Extraction Dataset), a strongly-supervised human annotated dataset for extracting relationships from a variety of text found on the World Wide Web, consisting of ~110K examples. We also describe the methods we used to collect ~200M examples as pre-training data for this task. We show that combining pre-training on a large weakly supervised dataset with fine-tuning on a small strongly-supervised dataset leads to better relation extraction performance. We provide baselines for this new dataset and present a case for the importance of human annotation in improving the performance of relation extraction from text found on the web.

Via

Access Paper or Ask Questions

Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

Oct 21, 2020

Vinay Rao, Jascha Sohl-Dickstein

Figure 1 for Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

Figure 2 for Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

Figure 3 for Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

Figure 4 for Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

Abstract:We perform an extensive empirical study of the statistical properties of Batch Norm and other common normalizers. This includes an examination of the correlation between representations of minibatches, gradient norms, and Hessian spectra both at initialization and over the course of training. Through this analysis, we identify several statistical properties which appear linked to Batch Norm's superior performance. We propose two simple normalizers, PreLayerNorm and RegNorm, which better match these desirable properties without involving operations along the batch dimension. We show that PreLayerNorm and RegNorm achieve much of the performance of Batch Norm without requiring batch dependence, that they reliably outperform LayerNorm, and that they can be applied in situations where Batch Norm is ineffective.

Via

Access Paper or Ask Questions

Assessing The Factual Accuracy of Generated Text

May 30, 2019

Ben Goodrich, Vinay Rao, Mohammad Saleh, Peter J Liu

Figure 1 for Assessing The Factual Accuracy of Generated Text

Figure 2 for Assessing The Factual Accuracy of Generated Text

Figure 3 for Assessing The Factual Accuracy of Generated Text

Figure 4 for Assessing The Factual Accuracy of Generated Text

Abstract:We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.

* The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '19), August 4--8, 2019, Anchorage, AK, USA

Via

Access Paper or Ask Questions

A Mean Field Theory of Batch Normalization

Mar 05, 2019

Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Figure 1 for A Mean Field Theory of Batch Normalization

Figure 2 for A Mean Field Theory of Batch Normalization

Figure 3 for A Mean Field Theory of Batch Normalization

Figure 4 for A Mean Field Theory of Batch Normalization

Abstract:We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range. Our theory leverages Laplace, Fourier, and Gegenbauer transforms and we derive new identities that may be of independent interest.

* To appear in ICLR 2019

Via

Access Paper or Ask Questions

Reducing Bias in Production Speech Models

May 11, 2017

Eric Battenberg, Rewon Child, Adam Coates, Christopher Fougner, Yashesh Gaur, Jiaji Huang, Heewoo Jun, Ajay Kannan, Markus Kliegl, Atul Kumar(+6 more)

Figure 1 for Reducing Bias in Production Speech Models

Figure 2 for Reducing Bias in Production Speech Models

Figure 3 for Reducing Bias in Production Speech Models

Figure 4 for Reducing Bias in Production Speech Models

Abstract:Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end speech models back into the underfitting regime and expose biases in the model that we show cannot be overcome by "scaling up", i.e., training bigger models on more data. In this work we systematically identify and address sources of bias, reducing error rates by up to 20% while remaining practical for deployment. We achieve this by utilizing improved neural architectures for streaming inference, solving optimization issues, and employing strategies that increase audio and label modelling versatility.

Via

Access Paper or Ask Questions

Active Learning for Speech Recognition: the Power of Gradients

Dec 10, 2016

Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu, Sanjeev Satheesh, Adam Coates

Figure 1 for Active Learning for Speech Recognition: the Power of Gradients

Figure 2 for Active Learning for Speech Recognition: the Power of Gradients

Figure 3 for Active Learning for Speech Recognition: the Power of Gradients

Abstract:In training speech recognition systems, labeling audio clips can be expensive, and not all data is equally valuable. Active learning aims to label only the most informative samples to reduce cost. For speech recognition, confidence scores and other likelihood-based active learning methods have been shown to be effective. Gradient-based active learning methods, however, are still not well-understood. This work investigates the Expected Gradient Length (EGL) approach in active learning for end-to-end speech recognition. We justify EGL from a variance reduction perspective, and observe that EGL's measure of informativeness picks novel samples uncorrelated with confidence scores. Experimentally, we show that EGL can reduce word errors by 11\%, or alternatively, reduce the number of samples to label by 50\%, when compared to random sampling.

* published as a workshop paper at NIPS 2016

Via

Access Paper or Ask Questions