Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jia-Jie Zhu

Gradient Flow Sampler-based Distributionally Robust Optimization

Oct 29, 2025

Zusen Xu, Jia-Jie Zhu

Abstract:We propose a mathematically principled PDE gradient flow framework for distributionally robust optimization (DRO). Exploiting the recent advances in the intersection of Markov Chain Monte Carlo sampling and gradient flow theory, we show that our theoretical framework can be implemented as practical algorithms for sampling from worst-case distributions and, consequently, DRO. While numerous previous works have proposed various reformulation techniques and iterative algorithms, we contribute a sound gradient flow view of the distributional optimization that can be used to construct new algorithms. As an example of applications, we solve a class of Wasserstein and Sinkhorn DRO problems using the recently-discovered Wasserstein Fisher-Rao and Stein variational gradient flows. Notably, we also show some simple reductions of our framework recover exactly previously proposed popular DRO methods, and provide new insights into their theoretical limit and optimization dynamics. Numerical studies based on stochastic gradient descent provide empirical backing for our theoretical findings.

Via

Access Paper or Ask Questions

Evolution of Gaussians in the Hellinger-Kantorovich-Boltzmann gradient flow

Apr 29, 2025

Matthias Liero, Alexander Mielke, Oliver Tse, Jia-Jie Zhu

Abstract:This study leverages the basic insight that the gradient-flow equation associated with the relative Boltzmann entropy, in relation to a Gaussian reference measure within the Hellinger-Kantorovich (HK) geometry, preserves the class of Gaussian measures. This invariance serves as the foundation for constructing a reduced gradient structure on the parameter space characterizing Gaussian densities. We derive explicit ordinary differential equations that govern the evolution of mean, covariance, and mass under the HK-Boltzmann gradient flow. The reduced structure retains the additive form of the HK metric, facilitating a comprehensive analysis of the dynamics involved. We explore the geodesic convexity of the reduced system, revealing that global convexity is confined to the pure transport scenario, while a variant of sublevel semi-convexity is observed in the general case. Furthermore, we demonstrate exponential convergence to equilibrium through Polyak-Lojasiewicz-type inequalities, applicable both globally and on sublevel sets. By monitoring the evolution of covariance eigenvalues, we refine the decay rates associated with convergence. Additionally, we extend our analysis to non-Gaussian targets exhibiting strong log-lambda-concavity, corroborating our theoretical results with numerical experiments that encompass a Gaussian-target gradient flow and a Bayesian logistic regression application.

Via

Access Paper or Ask Questions

Hellinger-Kantorovich Gradient Flows: Global Exponential Decay of Entropy Functionals

Jan 28, 2025

Alexander Mielke, Jia-Jie Zhu

Abstract:We investigate a family of gradient flows of positive and probability measures, focusing on the Hellinger-Kantorovich (HK) geometry, which unifies transport mechanism of Otto-Wasserstein, and the birth-death mechanism of Hellinger (or Fisher-Rao). A central contribution is a complete characterization of global exponential decay behaviors of entropy functionals (e.g. KL, $\chi^2$) under Otto-Wasserstein and Hellinger-type gradient flows. In particular, for the more challenging analysis of HK gradient flows on positive measures -- where the typical log-Sobolev arguments fail -- we develop a specialized shape-mass decomposition that enables new analysis results. Our approach also leverages the (Polyak-)\L{}ojasiewicz-type functional inequalities and a careful extension of classical dissipation estimates. These findings provide a unified and complete theoretical framework for gradient flows and underpin applications in computational algorithms for statistical inference, optimization, and machine learning.

Via

Access Paper or Ask Questions

Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

Oct 31, 2024

Jia-Jie Zhu

Abstract:Otto's (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing $ \mathrm{KL}(\pi \| \mu) $ with respect to $ \mu $ for some target $ \pi $, are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.

Via

Access Paper or Ask Questions

Kernel Approximation of Fisher-Rao Gradient Flows

Oct 27, 2024

Jia-Jie Zhu, Alexander Mielke

Abstract:The purpose of this paper is to answer a few open questions in the interface of kernel methods and PDE gradient flows. Motivated by recent advances in machine learning, particularly in generative modeling and sampling, we present a rigorous investigation of Fisher-Rao and Wasserstein type gradient flows concerning their gradient structures, flow equations, and their kernel approximations. Specifically, we focus on the Fisher-Rao (also known as Hellinger) geometry and its various kernel-based approximations, developing a principled theoretical framework using tools from PDE gradient flows and optimal transport theory. We also provide a complete characterization of gradient flows in the maximum-mean discrepancy (MMD) space, with connections to existing learning and inference algorithms. Our analysis reveals precise theoretical insights linking Fisher-Rao flows, Stein flows, kernel discrepancies, and nonparametric regression. We then rigorously prove evolutionary $\Gamma$-convergence for kernel-approximated Fisher-Rao flows, providing theoretical guarantees beyond pointwise convergence. Finally, we analyze energy dissipation using the Helmholtz-Rayleigh principle, establishing important connections between classical theory in mechanics and modern machine learning practice. Our results provide a unified theoretical foundation for understanding and analyzing approximations of gradient flows in machine learning applications through a rigorous gradient flow and variational method perspective.

Via

Access Paper or Ask Questions

Interaction-Force Transport Gradient Flows

May 27, 2024

Egor Gladin, Pavel Dvurechensky, Alexander Mielke, Jia-Jie Zhu

Abstract:This paper presents a new type of gradient flow geometries over non-negative and probability measures motivated via a principled construction that combines the optimal transport and interaction forces modeled by reproducing kernels. Concretely, we propose the interaction-force transport (IFT) gradient flows and its spherical variant via an infimal convolution of the Wasserstein and spherical MMD Riemannian metric tensors. We then develop a particle-based optimization algorithm based on the JKO-splitting scheme of the mass-preserving spherical IFT gradient flows. Finally, we provide both theoretical global exponential convergence guarantees and empirical simulation results for applying the IFT gradient flows to the sampling task of MMD-minimization studied by Arbel et al. [2019]. Furthermore, we prove that the spherical IFT gradient flow enjoys the best of both worlds by providing the global exponential convergence guarantee for both the MMD and KL energy.

Via

Access Paper or Ask Questions

Analysis of Kernel Mirror Prox for Measure Optimization

Feb 29, 2024

Pavel Dvurechensky, Jia-Jie Zhu

Abstract:By choosing a suitable function space as the dual to the non-negative measure cone, we study in a unified framework a class of functional saddle-point optimization problems, which we term the Mixed Functional Nash Equilibrium (MFNE), that underlies several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters. We model the saddle-point optimization dynamics as an interacting Fisher-Rao-RKHS gradient flow when the function space is chosen as a reproducing kernel Hilbert space (RKHS). As a discrete time counterpart, we propose a primal-dual kernel mirror prox (KMP) algorithm, which uses a dual step in the RKHS, and a primal entropic mirror prox step. We then provide a unified convergence analysis of KMP in an infinite-dimensional setting for this class of MFNE problems, which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case, where $N$ is the iteration counter. As a case study, we apply our analysis to DRO, providing algorithmic guarantees for DRO robustness and convergence.

* Accepted to AISTATS 2024

Via

Access Paper or Ask Questions

An Inexact Halpern Iteration with Application to Distributionally Robust Optimization

Feb 12, 2024

Ling Liang, Kim-Chuan Toh, Jia-Jie Zhu

Abstract:The Halpern iteration for solving monotone inclusion problems has gained increasing interests in recent years due to its simple form and appealing convergence properties. In this paper, we investigate the inexact variants of the scheme in both deterministic and stochastic settings. We conduct extensive convergence analysis and show that by choosing the inexactness tolerances appropriately, the inexact schemes admit an $O(k^{-1})$ convergence rate in terms of the (expected) residue norm. Our results relax the state-of-the-art inexactness conditions employed in the literature while sharing the same competitive convergence properties. We then demonstrate how the proposed methods can be applied for solving two classes of data-driven Wasserstein distributionally robust optimization problems that admit convex-concave min-max optimization reformulations. We highlight its capability of performing inexact computations for distributionally robust learning with stochastic first-order methods.

* Correct a typo in the title and update authors' information

Via

Access Paper or Ask Questions

Estimation Beyond Data Reweighting: Kernel Method of Moments

May 18, 2023

Heiner Kremer, Yassine Nemmour, Bernhard Schölkopf, Jia-Jie Zhu

Abstract:Moment restrictions and their conditional counterparts emerge in many areas of machine learning and statistics ranging from causal inference to reinforcement learning. Estimators for these tasks, generally called methods of moments, include the prominent generalized method of moments (GMM) which has recently gained attention in causal inference. GMM is a special case of the broader family of empirical likelihood estimators which are based on approximating a population distribution by means of minimizing a $\varphi$-divergence to an empirical distribution. However, the use of $\varphi$-divergences effectively limits the candidate distributions to reweightings of the data samples. We lift this long-standing limitation and provide a method of moments that goes beyond data reweighting. This is achieved by defining an empirical likelihood estimator based on maximum mean discrepancy which we term the kernel method of moments (KMM). We provide a variant of our estimator for conditional moment restrictions and show that it is asymptotically first-order optimal for such problems. Finally, we show that our method achieves competitive performance on several conditional moment restriction tasks.

Via

Access Paper or Ask Questions

Propagating Kernel Ambiguity Sets in Nonlinear Data-driven Dynamics Models

Apr 27, 2023

Jia-Jie Zhu

Abstract:This paper provides answers to an open problem: given a nonlinear data-driven dynamical system model, e.g., kernel conditional mean embedding (CME) and Koopman operator, how can one propagate the ambiguity sets forward for multiple steps? This problem is the key to solving distributionally robust control and learning-based control of such learned system models under a data-distribution shift. Different from previous works that use either static ambiguity sets, e.g., fixed Wasserstein balls, or dynamic ambiguity sets under known piece-wise linear (or affine) dynamics, we propose an algorithm that exactly propagates ambiguity sets through nonlinear data-driven models using the Koopman operator and CME, via the kernel maximum mean discrepancy geometry. Through both theoretical and numerical analysis, we show that our kernel ambiguity sets are the natural geometric structure for the learned data-driven dynamical system models.

Via

Access Paper or Ask Questions