Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Will Grathwohl

A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Sep 03, 2024

Nicholas Monath, Will Grathwohl, Michael Boratko, Rob Fergus, Andrew McCallum, Manzil Zaheer

Figure 1 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Figure 2 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Figure 3 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Figure 4 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Abstract:In dense retrieval, deep encoders provide embeddings for both inputs and targets, and the softmax function is used to parameterize a distribution over a large number of candidate targets (e.g., textual passages for information retrieval). Significant challenges arise in training such encoders in the increasingly prevalent scenario of (1) a large number of targets, (2) a computationally expensive target encoder model, (3) cached target embeddings that are out-of-date due to ongoing training of target encoder parameters. This paper presents a simple and highly scalable response to these challenges by training a small parametric corrector network that adjusts stale cached target embeddings, enabling an accurate softmax approximation and thereby sampling of up-to-date high scoring "hard negatives." We theoretically investigate the generalization properties of our proposed target corrector, relating the complexity of the network, staleness of cached representations, and the amount of training data. We present experimental results on large benchmark dense retrieval datasets as well as on QA with retrieval augmented language models. Our approach matches state-of-the-art results even when no target embedding updates are made during training beyond an initial cache from the unsupervised pre-trained model, providing a 4-80x reduction in re-embedding computational cost.

* ICML 2024

Via

Access Paper or Ask Questions

Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Aug 22, 2024

Paul Jeha, Will Grathwohl, Michael Riis Andersen, Carl Henrik Ek, Jes Frellsen

Figure 1 for Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Figure 2 for Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Figure 3 for Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Figure 4 for Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Abstract:Score-based models, trained with denoising score matching, are remarkably effective in generating high dimensional data. However, the high variance of their training objective hinders optimisation. We attempt to reduce it with a control variate, derived via a $k$-th order Taylor expansion on the training objective and its gradient. We prove an equivalence between the two and demonstrate empirically the effectiveness of our approach on a low dimensional problem setting; and study its effect on larger problems.

* 14 pages, ICML Structured Probabilistic Inference & Generative Modeling 2024

Via

Access Paper or Ask Questions

Denoising Diffusion Samplers

Feb 27, 2023

Francisco Vargas, Will Grathwohl, Arnaud Doucet

Abstract:Denoising diffusion models are a popular class of generative models providing state-of-the-art results in many domains. One adds gradually noise to data using a diffusion to transform the data distribution into a Gaussian distribution. Samples from the generative model are then obtained by simulating an approximation of the time-reversal of this diffusion initialized by Gaussian samples. Practically, the intractable score terms appearing in the time-reversed process are approximated using score matching techniques. We explore here a similar idea to sample approximately from unnormalized probability density functions and estimate their normalizing constants. We consider a process where the target density diffuses towards a Gaussian. Denoising Diffusion Samplers (DDS) are obtained by approximating the corresponding time-reversal. While score matching is not applicable in this context, we can leverage many of the ideas introduced in generative modeling for Monte Carlo sampling. Existing theoretical results from denoising diffusion models also provide theoretical guarantees for DDS. We discuss the connections between DDS, optimal control and Schr\"odinger bridges and finally demonstrate DDS experimentally on a variety of challenging sampling tasks.

* In The Eleventh In The Eleventh International Conference on Learning Representations, 2023
* In The Eleventh International Conference on Learning Representations, 2023

Via

Access Paper or Ask Questions

Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

Feb 22, 2023

Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, Will Grathwohl

Figure 1 for Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

Figure 2 for Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

Figure 3 for Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

Figure 4 for Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

Abstract:Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.

* Project Webpage: https://energy-based-model.github.io/reduce-reuse-recycle/

Via

Access Paper or Ask Questions

Continuous diffusion for categorical data

Dec 15, 2022

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan(+4 more)

Figure 1 for Continuous diffusion for categorical data

Figure 2 for Continuous diffusion for categorical data

Figure 3 for Continuous diffusion for categorical data

Figure 4 for Continuous diffusion for categorical data

Abstract:Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.

* 26 pages, 8 figures; corrections and additional information about hyperparameters

Via

Access Paper or Ask Questions

Self-conditioned Embedding Diffusion for Text Generation

Nov 08, 2022

Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre(+1 more)

Abstract:Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion, a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models - while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.

* 15 pages

Via

Access Paper or Ask Questions

Learning to Navigate Wikipedia by Taking Random Walks

Oct 31, 2022

Manzil Zaheer, Kenneth Marino, Will Grathwohl, John Schultz, Wendy Shang, Sheila Babayan, Arun Ahuja, Ishita Dasgupta, Christine Kaeser-Chen, Rob Fergus

Abstract:A fundamental ability of an intelligent web-based agent is seeking out and acquiring new information. Internet search engines reliably find the correct vicinity but the top results may be a few links away from the desired target. A complementary approach is navigation via hyperlinks, employing a policy that comprehends local content and selects a link that moves it closer to the target. In this paper, we show that behavioral cloning of randomly sampled trajectories is sufficient to learn an effective link selection policy. We demonstrate the approach on a graph version of Wikipedia with 38M nodes and 387M edges. The model is able to efficiently navigate between nodes 5 and 20 steps apart 96% and 92% of the time, respectively. We then use the resulting embeddings and policy in downstream fact verification and question answering tasks where, in combination with basic TF-IDF search and ranking methods, they are competitive results to the state-of-the-art methods.

* NeurIPS 2022

Via

Access Paper or Ask Questions

Score-Based Diffusion meets Annealed Importance Sampling

Aug 17, 2022

Arnaud Doucet, Will Grathwohl, Alexander G. D. G. Matthews, Heiko Strathmann

Figure 1 for Score-Based Diffusion meets Annealed Importance Sampling

Figure 2 for Score-Based Diffusion meets Annealed Importance Sampling

Figure 3 for Score-Based Diffusion meets Annealed Importance Sampling

Figure 4 for Score-Based Diffusion meets Annealed Importance Sampling

Abstract:More than twenty years after its introduction, Annealed Importance Sampling (AIS) remains one of the most effective methods for marginal likelihood estimation. It relies on a sequence of distributions interpolating between a tractable initial distribution and the target distribution of interest which we simulate from approximately using a non-homogeneous Markov chain. To obtain an importance sampling estimate of the marginal likelihood, AIS introduces an extended target distribution to reweight the Markov chain proposal. While much effort has been devoted to improving the proposal distribution used by AIS, by changing the intermediate distributions and corresponding Markov kernels, an underappreciated issue is that AIS uses a convenient but suboptimal extended target distribution. This can hinder its performance. We here leverage recent progress in score-based generative modeling (SGM) to approximate the optimal extended target distribution for AIS proposals corresponding to the discretization of Langevin and Hamiltonian dynamics. We demonstrate these novel, differentiable, AIS procedures on a number of synthetic benchmark distributions and variational auto-encoders.

* under review

Via

Access Paper or Ask Questions

Directly Training Joint Energy-Based Models for Conditional Synthesis and Calibrated Prediction of Multi-Attribute Data

Jul 19, 2021

Jacob Kelly, Richard Zemel, Will Grathwohl

Figure 1 for Directly Training Joint Energy-Based Models for Conditional Synthesis and Calibrated Prediction of Multi-Attribute Data

Figure 2 for Directly Training Joint Energy-Based Models for Conditional Synthesis and Calibrated Prediction of Multi-Attribute Data

Figure 3 for Directly Training Joint Energy-Based Models for Conditional Synthesis and Calibrated Prediction of Multi-Attribute Data

Figure 4 for Directly Training Joint Energy-Based Models for Conditional Synthesis and Calibrated Prediction of Multi-Attribute Data

Abstract:Multi-attribute classification generalizes classification, presenting new challenges for making accurate predictions and quantifying uncertainty. We build upon recent work and show that architectures for multi-attribute prediction can be reinterpreted as energy-based models (EBMs). While existing EBM approaches achieve strong discriminative performance, they are unable to generate samples conditioned on novel attribute combinations. We propose a simple extension which expands the capabilities of EBMs to generating accurate conditional samples. Our approach, combined with newly developed techniques in energy-based model training, allows us to directly maximize the likelihood of data and labels under the unnormalized joint distribution. We evaluate our proposed approach on high-dimensional image data with high-dimensional binary attribute labels. We find our models are capable of both accurate, calibrated predictions and high-quality conditional synthesis of novel attribute combinations.

Via

Access Paper or Ask Questions

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Feb 08, 2021

Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, Chris J. Maddison

Figure 1 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Figure 2 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Figure 3 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Figure 4 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Abstract:We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.

* Energy-Based Models, Deep generative models, MCMC sampling

Via

Access Paper or Ask Questions