Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhenisbek Assylbekov

Learning Overspecified Gaussian Mixtures Exponentially Fast with the EM Algorithm

Jun 13, 2025

Zhenisbek Assylbekov, Alan Legg, Artur Pak

Abstract:We investigate the convergence properties of the EM algorithm when applied to overspecified Gaussian mixture models -- that is, when the number of components in the fitted model exceeds that of the true underlying distribution. Focusing on a structured configuration where the component means are positioned at the vertices of a regular simplex and the mixture weights satisfy a non-degeneracy condition, we demonstrate that the population EM algorithm converges exponentially fast in terms of the Kullback-Leibler (KL) distance. Our analysis leverages the strong convexity of the negative log-likelihood function in a neighborhood around the optimum and utilizes the Polyak-{\L}ojasiewicz inequality to establish that an $\epsilon$-accurate approximation is achievable in $O(\log(1/\epsilon))$ iterations. Furthermore, we extend these results to a finite-sample setting by deriving explicit statistical convergence guarantees. Numerical experiments on synthetic datasets corroborate our theoretical findings, highlighting the dramatic acceleration in convergence compared to conventional sublinear rates. This work not only deepens the understanding of EM's behavior in overspecified settings but also offers practical insights into initialization strategies and model design for high-dimensional clustering and density estimation tasks.

* ECML PKDD 2025

Via

Access Paper or Ask Questions

Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Oct 19, 2023

Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhenisbek Assylbekov

Abstract:Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.

Via

Access Paper or Ask Questions

Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Oct 02, 2023

Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhibek Kadyrsizova, Zhenisbek Assylbekov

Figure 1 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Figure 2 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Figure 3 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Figure 4 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Abstract:The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases.

* ACML 2023

Via

Access Paper or Ask Questions

Long-Tail Theory under Gaussian Mixtures

Jul 24, 2023

Arman Bolatov, Maxat Tezekbayev, Igor Melnykov, Artur Pak, Vassilina Nikoulina, Zhenisbek Assylbekov

Figure 1 for Long-Tail Theory under Gaussian Mixtures

Figure 2 for Long-Tail Theory under Gaussian Mixtures

Figure 3 for Long-Tail Theory under Gaussian Mixtures

Figure 4 for Long-Tail Theory under Gaussian Mixtures

Abstract:We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered for optimal generalization to new data. Finally, we show that the performance gap between linear and nonlinear models can be lessened as the tail becomes shorter in the subpopulation frequency distribution, as confirmed by experiments on synthetic and real data.

* accepted to ECAI 2023

Via

Access Paper or Ask Questions

From Hyperbolic Geometry Back to Word Embeddings

Apr 26, 2022

Sultan Nurmukhamedov, Thomas Mach, Arsen Sheverdin, Zhenisbek Assylbekov

Figure 1 for From Hyperbolic Geometry Back to Word Embeddings

Figure 2 for From Hyperbolic Geometry Back to Word Embeddings

Figure 3 for From Hyperbolic Geometry Back to Word Embeddings

Abstract:We choose random points in the hyperbolic disc and claim that these points are already word representations. However, it is yet to be uncovered which point corresponds to which word of the human language of interest. This correspondence can be approximately established using a pointwise mutual information between words and recent alignment techniques.

Via

Access Paper or Ask Questions

Speeding Up Entmax

Nov 15, 2021

Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov

Abstract:Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $\alpha$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this problem, but is considerably slower than softmax. In this paper, we propose an alternative to $\alpha$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Mar 02, 2021

Vassilina Nikoulina, Maxat Tezekbayev, Nuradil Kozhakhmet, Madina Babazhanova, Matthias Gallé, Zhenisbek Assylbekov

Figure 1 for The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Figure 2 for The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Figure 3 for The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Figure 4 for The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Abstract:There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called \textit{probes}. In this paper we study whether linguistic knowledge is a necessary condition for good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objective with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real tasks.

Via

Access Paper or Ask Questions

Binarized PMI Matrix: Bridging Word Embeddings and Hyperbolic Spaces

Feb 27, 2020

Zhenisbek Assylbekov, Alibi Jangeldin

Figure 1 for Binarized PMI Matrix: Bridging Word Embeddings and Hyperbolic Spaces

Figure 2 for Binarized PMI Matrix: Bridging Word Embeddings and Hyperbolic Spaces

Figure 3 for Binarized PMI Matrix: Bridging Word Embeddings and Hyperbolic Spaces

Figure 4 for Binarized PMI Matrix: Bridging Word Embeddings and Hyperbolic Spaces

Abstract:We show analytically that removing sigmoid transformation in the SGNS objective does not harm the quality of word vectors significantly and at the same time is related to factorizing a binarized PMI matrix which, in turn, can be treated as an adjacency matrix of a certain graph. Empirically, such graph is a complex network, i.e. it has strong clustering and scale-free degree distribution, and is tightly connected with hyperbolic spaces. In short, we show the connection between static word embeddings and hyperbolic spaces through the binarized PMI matrix using analytical and empirical methods.

Via

Access Paper or Ask Questions

Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

Dec 23, 2019

Maxat Tezekbayev, Zhenisbek Assylbekov, Rustem Takhanov

Figure 1 for Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

Figure 2 for Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

Figure 3 for Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

Figure 4 for Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

Abstract:We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word.

* 2 pages, 1 figure, Student Abstract

Via

Access Paper or Ask Questions

A Critique of the Smooth Inverse Frequency Sentence Embeddings

Sep 30, 2019

Aidana Karipbayeva, Alena Sorokina, Zhenisbek Assylbekov

Figure 1 for A Critique of the Smooth Inverse Frequency Sentence Embeddings

Figure 2 for A Critique of the Smooth Inverse Frequency Sentence Embeddings

Abstract:We critically review the smooth inverse frequency sentence embedding method of Arora, Liang, and Ma (2017), and show inconsistencies in its setup, derivation, and evaluation.

* 2 pages, 2 figures, Abstract

Via

Access Paper or Ask Questions