Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jörg Lücke

Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

Jan 21, 2025

Sebastian Salwig, Till Kahlke, Florian Hirschberger, Dennis Forster, Jörg Lücke

Figure 1 for Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

Figure 2 for Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

Figure 3 for Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

Figure 4 for Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

Abstract:Gaussian Mixture Models (GMMs) range among the most frequently used machine learning models. However, training large, general GMMs becomes computationally prohibitive for datasets with many data points $N$ of high-dimensionality $D$. For GMMs with arbitrary covariances, we here derive a highly efficient variational approximation, which is integrated with mixtures of factor analyzers (MFAs). For GMMs with $C$ components, our proposed algorithm significantly reduces runtime complexity per iteration from $\mathcal{O}(NCD^2)$ to a complexity scaling linearly with $D$ and remaining constant w.r.t. $C$. Numerical validation of this theoretical complexity reduction then shows the following: the distance evaluations required for the entire GMM optimization process scale sublinearly with $NC$. On large-scale benchmarks, this sublinearity results in speed-ups of an order-of-magnitude compared to the state-of-the-art. As a proof of concept, we train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.

* 22 pages, 6 figures (and 17 pages, 3 figures in Appendix)

Via

Access Paper or Ask Questions

Learning Sparse Codes with Entropy-Based ELBOs

Nov 03, 2023

Dmytro Velychko, Simon Damm, Asja Fischer, Jörg Lücke

Abstract:Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.

Via

Access Paper or Ask Questions

On the Convergence of the ELBO to Entropy Sums

Sep 07, 2022

Jörg Lücke

Abstract:The variational lower bound (a.k.a. ELBO or free energy) is the central objective for many learning algorithms including algorithms for deep unsupervised learning. Learning algorithms change model parameters such that the variational lower bound increases, and until the parameters are close to a stationary point of the learning dynamics. In this purely theoretical contribution, we show that (for a very large class of generative models) the variational lower bound is at all stationary points of learning equal to a sum of entropies. For models with one set of latents and one set observed variables, the sum consists of three entropies: (A) the (average) entropy of the variational distributions, (B) the negative entropy of the model's prior distribution, and (C) the (expected) negative entropy of the observable distributions. The obtained result applies under realistic conditions including: finite numbers of data points, at any stationary points (including saddle points) and for any family of (well behaved) variational distributions. The class of generative models for which we show the equality to entropy sums contains many (and presumably most) standard generative models (including deep models). As concrete examples we discuss probabilistic PCA and Sigmoid Belief Networks. The prerequisites we use to show equality to entropy sums are relatively mild. Concretely, the distributions of a given generative model have to be of the exponential family (with constant base measure), and a model has to satisfy a parameterization criterion (which is usually fulfilled). Proving the equality of the ELBO to entropy sums at stationary points (under the stated conditions) is the main contribution of this work.

* 22 pages

Via

Access Paper or Ask Questions

Evolutionary Variational Optimization of Generative Models

Dec 22, 2020

Jakob Drefs, Enrico Guiraud, Jörg Lücke

Figure 1 for Evolutionary Variational Optimization of Generative Models

Figure 2 for Evolutionary Variational Optimization of Generative Models

Figure 3 for Evolutionary Variational Optimization of Generative Models

Figure 4 for Evolutionary Variational Optimization of Generative Models

Abstract:We combine two popular optimization approaches to derive learning algorithms for generative models: variational optimization and evolutionary algorithms. The combination is realized for generative models with discrete latents by using truncated posteriors as the family of variational distributions. The variational parameters of truncated posteriors are sets of latent states. By interpreting these states as genomes of individuals and by using the variational lower bound to define a fitness, we can apply evolutionary algorithms to realize the variational loop. The used variational distributions are very flexible and we show that evolutionary algorithms can effectively and efficiently optimize the variational bound. Furthermore, the variational loop is generally applicable ("black box") with no analytical derivations required. To show general applicability, we apply the approach to three generative models (we use noisy-OR Bayes Nets, Binary Sparse Coding, and Spike-and-Slab Sparse Coding). To demonstrate effectiveness and efficiency of the novel variational approach, we use the standard competitive benchmarks of image denoising and inpainting. The benchmarks allow quantitative comparisons to a wide range of methods including probabilistic approaches, deep deterministic and generative networks, and non-local image processing methods. In the category of "zero-shot" learning (when only the corrupted image is used for training), we observed the evolutionary variational algorithm to significantly improve the state-of-the-art in many benchmark settings. For one well-known inpainting benchmark, we also observed state-of-the-art performance across all categories of algorithms although we only train on the corrupted image. In general, our investigations highlight the importance of research on optimization methods for generative models to achieve performance improvements.

Via

Access Paper or Ask Questions

Direct Evolutionary Optimization of Variational Autoencoders With Binary Latents

Nov 27, 2020

Enrico Guiraud, Jakob Drefs, Jörg Lücke

Figure 1 for Direct Evolutionary Optimization of Variational Autoencoders With Binary Latents

Figure 2 for Direct Evolutionary Optimization of Variational Autoencoders With Binary Latents

Figure 3 for Direct Evolutionary Optimization of Variational Autoencoders With Binary Latents

Figure 4 for Direct Evolutionary Optimization of Variational Autoencoders With Binary Latents

Abstract:Discrete latent variables are considered important for real world data, which has motivated research on Variational Autoencoders (VAEs) with discrete latents. However, standard VAE-training is not possible in this case, which has motivated different strategies to manipulate discrete distributions in order to train discrete VAEs similarly to conventional ones. Here we ask if it is also possible to keep the discrete nature of the latents fully intact by applying a direct discrete optimization for the encoding model. The approach is consequently strongly diverting from standard VAE-training by sidestepping sampling approximation, reparameterization trick and amortization. Discrete optimization is realized in a variational setting using truncated posteriors in conjunction with evolutionary algorithms. For VAEs with binary latents, we (A) show how such a discrete variational method ties into gradient ascent for network weights, and (B) how the decoder is used to select latent states for training. Conventional amortized training is more efficient and applicable to large neural networks. However, using smaller networks, we here find direct discrete optimization to be efficiently scalable to hundreds of latents. More importantly, we find the effectiveness of direct optimization to be highly competitive in `zero-shot' learning. In contrast to large supervised networks, the here investigated VAEs can, e.g., denoise a single image without previous training on clean data and/or training on large image datasets. More generally, the studied approach shows that training of VAEs is indeed possible without sampling-based approximation and reparameterization, which may be interesting for the analysis of VAE-training in general. For `zero-shot' settings a direct optimization, furthermore, makes VAEs competitive where they have previously been outperformed by non-generative approaches.

Via

Access Paper or Ask Questions

The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies

Oct 28, 2020

Jörg Lücke, Dennis Forster, Zhenwen Dai

Figure 1 for The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies

Figure 2 for The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies

Figure 3 for The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies

Abstract:The central objective function of a variational autoencoder (VAE) is its variational lower bound. Here we show that for standard VAEs the variational bound is at convergence equal to the sum of three entropies: the (negative) entropy of the latent distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions. Our derived analytical results are exact and apply for small as well as complex neural networks for decoder and encoder. Furthermore, they apply for finite and infinitely many data points and at any stationary point (including local and global maxima). As a consequence, we show that the variance parameters of encoder and decoder play the key role in determining the values of variational bounds at convergence. Furthermore, the obtained results can allow for closed-form analytical expressions at convergence, which may be unexpected as neither variational bounds of VAEs nor log-likelihoods of VAEs are closed-form during learning. As our main contribution, we provide the proofs for convergence of standard VAEs to sums of entropies. Furthermore, we numerically verify our analytical results and discuss some potential applications. The obtained equality to entropy sums provides novel information on those points in parameter space that variational learning converges to. As such, we believe they can potentially significantly contribute to our understanding of established as well as novel VAE approaches.

Via

Access Paper or Ask Questions

Maximal Causes for Exponential Family Observables

Mar 04, 2020

S. Hamid Mousavi, Jakob Drefs, Florian Hirschberger, Jörg Lücke

Figure 1 for Maximal Causes for Exponential Family Observables

Figure 2 for Maximal Causes for Exponential Family Observables

Figure 3 for Maximal Causes for Exponential Family Observables

Figure 4 for Maximal Causes for Exponential Family Observables

Abstract:The data model of standard sparse coding assumes a weighted linear summation of latents to determine the mean of Gaussian observation noise. However, such a linear summation of latents is often at odds with non-Gaussian observables (e.g., means of the Bernoulli distribution have to lie in the unit interval), and also in the Gaussian case it can be difficult to justify for many types of data. Alternative superposition models (i.e., links between latents and observables) have therefore been investigated repeatedly. Here we show that using the maximum instead of a linear sum to link latents to observables allows for the derivation of very general and concise parameter update equations. Concretely, we derive a set of update equations that has the same functional form for all distributions of the exponential family (given that derivatives w.r.t. their parameters can be taken). Our results consequently allow for the development of latent variable models for commonly as well as for unusually distributed data. We numerically verify our analytical result assuming standard Gaussian, Gamma, Poisson, Bernoulli and Exponential distributions and point to some potential applications.

Via

Access Paper or Ask Questions

ProSper -- A Python Library for Probabilistic Sparse Coding with Non-Standard Priors and Superpositions

Aug 01, 2019

Georgios Exarchakis, Jörg Bornschein, Abdul-Saboor Sheikh, Zhenwen Dai, Marc Henniges, Jakob Drefs, Jörg Lücke

Figure 1 for ProSper -- A Python Library for Probabilistic Sparse Coding with Non-Standard Priors and Superpositions

Figure 2 for ProSper -- A Python Library for Probabilistic Sparse Coding with Non-Standard Priors and Superpositions

Abstract:ProSper is a python library containing probabilistic algorithms to learn dictionaries. Given a set of data points, the implemented algorithms seek to learn the elementary components that have generated the data. The library widens the scope of dictionary learning approaches beyond implementations of standard approaches such as ICA, NMF or standard L1 sparse coding. The implemented algorithms are especially well-suited in cases when data consist of components that combine non-linearly and/or for data requiring flexible prior distributions. Furthermore, the implemented algorithms go beyond standard approaches by inferring prior and noise parameters of the data, and they provide rich a-posteriori approximations for inference. The library is designed to be extendable and it currently includes: Binary Sparse Coding (BSC), Ternary Sparse Coding (TSC), Discrete Sparse Coding (DSC), Maximal Causes Analysis (MCA), Maximum Magnitude Causes Analysis (MMCA), and Gaussian Sparse Coding (GSC, a recent spike-and-slab sparse coding approach). The algorithms are scalable due to a combination of variational approximations and parallelization. Implementations of all algorithms allow for parallel execution on multiple CPUs and multiple machines for medium to large-scale applications. Typical large-scale runs of the algorithms can use hundreds of CPUs to learn hundreds of dictionary elements from data with tens of millions of floating-point numbers such that models with several hundred thousand parameters can be optimized. The library is designed to have minimal dependencies and to be easy to use. It targets users of dictionary learning algorithms and Machine Learning researchers.

Via

Access Paper or Ask Questions

Accelerated Training of Large-Scale Gaussian Mixtures by a Merger of Sublinear Approaches

Oct 01, 2018

Florian Hirschberger, Dennis Forster, Jörg Lücke

Figure 1 for Accelerated Training of Large-Scale Gaussian Mixtures by a Merger of Sublinear Approaches

Figure 2 for Accelerated Training of Large-Scale Gaussian Mixtures by a Merger of Sublinear Approaches

Figure 3 for Accelerated Training of Large-Scale Gaussian Mixtures by a Merger of Sublinear Approaches

Abstract:We combine two recent lines of research on sublinear clustering to significantly increase the efficiency of training Gaussian mixture models (GMMs) on large scale problems. First, we use a novel truncated variational EM approach for GMMs with isotropic Gaussians in order to increase clustering efficiency for large $C$ (many clusters). Second, we use recent coreset approaches to increase clustering efficiency for large $N$ (many data points). In order to derive a novel accelerated algorithm, we first show analytically how variational EM and coreset objectives can be merged to give rise to a new, combined clustering objective. Each iteration of the novel algorithm derived from this merged objective is then shown to have a run-time cost of $\mathcal{O}(N' G^2 D)$ per iteration, where $N'<N$ is the coreset size and $G^2<C$ is a constant related to the extent of local cluster neighborhoods. While enabling clustering with a strongly reduced number of distance evaluations per iteration, the combined approach is observed to still very effectively increase the clustering objective. In a series of numerical experiments on standard benchmarks, we use efficient seeding for initialization and evaluate the net computational demand of the merged approach in comparison to (already highly efficient) recent approaches. As result, depending on the dataset and number of clusters, the merged algorithm shows several times (and up to an order of magnitude) faster execution times to reach the same quantization errors as algorithms based on coresets or on variational EM alone.

Via

Access Paper or Ask Questions

Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means

Apr 17, 2018

Dennis Forster, Jörg Lücke

Figure 1 for Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means

Figure 2 for Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means

Figure 3 for Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means

Figure 4 for Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means

Abstract:One iteration of standard $k$-means (i.e., Lloyd's algorithm) or standard EM for Gaussian mixture models (GMMs) scales linearly with the number of clusters $C$, data points $N$, and data dimensionality $D$. In this study, we explore whether one iteration of $k$-means or EM for GMMs can scale sublinearly with $C$ at run-time, while improving the clustering objective remains effective. The tool we apply for complexity reduction is variational EM, which is typically used to make training of generative models with exponentially many hidden states tractable. Here, we apply novel theoretical results on truncated variational EM to make tractable clustering algorithms more efficient. The basic idea is to use a partial variational E-step which reduces the linear complexity of $\mathcal{O}(NCD)$ required for a full E-step to a sublinear complexity. Our main observation is that the linear dependency on $C$ can be reduced to a dependency on a much smaller parameter $G$ which relates to cluster neighborhood relations. We focus on two versions of partial variational EM for clustering: variational GMM, scaling with $\mathcal{O}(NG^2D)$, and variational $k$-means, scaling with $\mathcal{O}(NGD)$ per iteration. Empirical results show that these algorithms still require comparable numbers of iterations to improve the clustering objective to same values as $k$-means. For data with many clusters, we consequently observe reductions of net computational demands between two and three orders of magnitude. More generally, our results provide substantial empirical evidence in favor of clustering to scale sublinearly with $C$.

Via

Access Paper or Ask Questions