Abstract:We consider a minimax problem motivated by distributionally robust optimization (DRO) when the worst-case distribution is continuous, leading to significant computational challenges due to the infinite-dimensional nature of the optimization problem. Recent research has explored learning the worst-case distribution using neural network-based generative models to address these computational challenges but lacks algorithmic convergence guarantees. This paper bridges this theoretical gap by presenting an iterative algorithm to solve such a minimax problem, achieving global convergence under mild assumptions and leveraging technical tools from vector space minimax optimization and convex analysis in the space of continuous probability densities. In particular, leveraging Brenier's theorem, we represent the worst-case distribution as a transport map applied to a continuous reference measure and reformulate the regularized discrepancy-based DRO as a minimax problem in the Wasserstein space. Furthermore, we demonstrate that the worst-case distribution can be efficiently computed using a modified Jordan-Kinderlehrer-Otto (JKO) scheme with sufficiently large regularization parameters for commonly used discrepancy functions, linked to the radius of the ambiguity set. Additionally, we derive the global convergence rate and quantify the total number of subgradient and inexact modified JKO iterations required to obtain approximate stationary points. These results are potentially applicable to nonconvex and nonsmooth scenarios, with broad relevance to modern machine learning applications.
Abstract:Reliable uncertainty quantification at unobserved spatial locations, especially in the presence of complex and heterogeneous datasets, remains a core challenge in spatial statistics. Traditional approaches like Kriging rely heavily on assumptions such as normality, which often break down in large-scale, diverse datasets, leading to unreliable prediction intervals. While machine learning methods have emerged as powerful alternatives, they primarily focus on point predictions and provide limited mechanisms for uncertainty quantification. Conformal prediction, a distribution-free framework, offers valid prediction intervals without relying on parametric assumptions. However, existing conformal prediction methods are either not tailored for spatial settings, or existing ones for spatial data have relied on rather restrictive i.i.d. assumptions. In this paper, we propose Localized Spatial Conformal Prediction (LSCP), a conformal prediction method designed specifically for spatial data. LSCP leverages localized quantile regression to construct prediction intervals. Instead of i.i.d. assumptions, our theoretical analysis builds on weaker conditions of stationarity and spatial mixing, which is natural for spatial data, providing finite-sample bounds on the conditional coverage gap and establishing asymptotic guarantees for conditional coverage. We present experiments on both synthetic and real-world datasets to demonstrate that LSCP achieves accurate coverage with significantly tighter and more consistent prediction intervals across the spatial domain compared to existing methods.
Abstract:In recent years, increasingly unpredictable and severe global weather patterns have frequently caused long-lasting power outages. Building resilience, the ability to withstand, adapt to, and recover from major disruptions, has become crucial for the power industry. To enable rapid recovery, accurately predicting future outage numbers is essential. Rather than relying on simple point estimates, we analyze extensive quarter-hourly outage data and develop a graph conformal prediction method that delivers accurate prediction regions for outage numbers across the states for a time period. We demonstrate the effectiveness of this method through extensive numerical experiments in several states affected by extreme weather events that led to widespread outages.
Abstract:Point processes are widely used statistical models for uncovering the temporal patterns in dependent event data. In many applications, the event time cannot be observed exactly, calling for the incorporation of time uncertainty into the modeling of point process data. In this work, we introduce a framework to model time-uncertain point processes possibly on a network. We start by deriving the formulation in the continuous-time setting under a few assumptions motivated by application scenarios. After imposing a time grid, we obtain a discrete-time model that facilitates inference and can be computed by first-order optimization methods such as Gradient Descent or Variation inequality (VI) using batch-based Stochastic Gradient Descent (SGD). The parameter recovery guarantee is proved for VI inference at an $O(1/k)$ convergence rate using $k$ SGD steps. Our framework handles non-stationary processes by modeling the inference kernel as a matrix (or tensor on a network) and it covers the stationary process, such as the classical Hawkes process, as a special case. We experimentally show that the proposed approach outperforms previous General Linear model (GLM) baselines on simulated and real data and reveals meaningful causal relations on a Sepsis-associated Derangements dataset.
Abstract:Flow Matching (FM) is a simulation-free method for learning a continuous and invertible flow to interpolate between two distributions, and in particular to generate data from noise in generative modeling. In this paper, we introduce Local Flow Matching (LFM), which learns a sequence of FM sub-models and each matches a diffusion process up to the time of the step size in the data-to-noise direction. In each step, the two distributions to be interpolated by the sub-model are closer to each other than data vs. noise, and this enables the use of smaller models with faster training. The stepwise structure of LFM is natural to be distilled and different distillation techniques can be adopted to speed up generation. Theoretically, we prove a generation guarantee of the proposed flow model in terms of the $\chi^2$-divergence between the generated and true data distributions. In experiments, we demonstrate the improved training efficiency and competitive generative performance of LFM compared to FM on the unconditional generation of tabular data and image datasets, and also on the conditional generation of robotic manipulation policies.
Abstract:Posterior sampling in high-dimensional spaces using generative models holds significant promise for various applications, including but not limited to inverse problems and guided generation tasks. Despite many recent developments, generating diverse posterior samples remains a challenge, as existing methods require restarting the entire generative process for each new sample, making the procedure computationally expensive. In this work, we propose efficient posterior sampling by simulating Langevin dynamics in the noise space of a pre-trained generative model. By exploiting the mapping between the noise and data spaces which can be provided by distilled flows or consistency models, our method enables seamless exploration of the posterior without the need to re-run the full sampling chain, drastically reducing computational overhead. Theoretically, we prove a guarantee for the proposed noise-space Langevin dynamics to approximate the posterior, assuming that the generative model sufficiently approximates the prior distribution. Our framework is experimentally validated on image restoration tasks involving noisy linear and nonlinear forward operators applied to LSUN-Bedroom (256 x 256) and ImageNet (64 x 64) datasets. The results demonstrate that our approach generates high-fidelity samples with enhanced semantic diversity even under a limited number of function evaluations, offering superior efficiency and performance compared to existing diffusion-based posterior sampling techniques.
Abstract:Sampling from high-dimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a continuous normalizing flow-based approach designed to sample from high-dimensional and multi-modal distributions. The key idea is to learn a continuous normalizing flow-based transport map, guided by annealing, to transition samples from an easy-to-sample distribution to the target distribution, facilitating effective exploration of modes in high-dimensional spaces. Unlike many existing methods, AF training does not rely on samples from the target distribution. AF ensures effective and balanced mode exploration, achieves linear complexity in sample size and dimensions, and circumvents inefficient mixing times. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight the potential of AF for sampling the least favorable distributions.
Abstract:Despite the growing prevalence of artificial neural networks in real-world applications, their vulnerability to adversarial attacks remains a significant concern, which motivates us to investigate the robustness of machine learning models. While various heuristics aim to optimize the distributionally robust risk using the $\infty$-Wasserstein metric, such a notion of robustness frequently encounters computation intractability. To tackle the computational challenge, we develop a novel approach to adversarial training that integrates $\phi$-divergence regularization into the distributionally robust risk function. This regularization brings a notable improvement in computation compared with the original formulation. We develop stochastic gradient methods with biased oracles to solve this problem efficiently, achieving the near-optimal sample complexity. Moreover, we establish its regularization effects and demonstrate it is asymptotic equivalence to a regularized empirical risk minimization framework, by considering various scaling regimes of the regularization parameter and robustness level. These regimes yield gradient norm regularization, variance regularization, or a smoothed gradient norm regularization that interpolates between these extremes. We numerically validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.
Abstract:In the wild, we often encounter collections of sequential data such as electrocardiograms, motion capture, genomes, and natural language, and sequences may be multichannel or symbolic with nonlinear dynamics. We introduce a new method to learn low-dimensional representations of nonlinear time series without supervision and can have provable recovery guarantees. The learned representation can be used for downstream machine-learning tasks such as clustering and classification. The method is based on the assumption that the observed sequences arise from a common domain, but each sequence obeys its own autoregressive models that are related to each other through low-rank regularization. We cast the problem as a computationally efficient convex matrix parameter recovery problem using monotone Variational Inequality and encode the common domain assumption via low-rank constraint across the learned representations, which can learn the geometry for the entire domain as well as faithful representations for the dynamics of each individual sequence using the domain information in totality. We show the competitive performance of our method on real-world time-series data with the baselines and demonstrate its effectiveness for symbolic text modeling and RNA sequence clustering.
Abstract:We present a conformal prediction method for time series using the Transformer architecture to capture long-memory and long-range dependencies. Specifically, we use the Transformer decoder as a conditional quantile estimator to predict the quantiles of prediction residuals, which are used to estimate the prediction interval. We hypothesize that the Transformer decoder benefits the estimation of the prediction interval by learning temporal dependencies across past prediction residuals. Our comprehensive experiments using simulated and real data empirically demonstrate the superiority of the proposed method compared to the existing state-of-the-art conformal prediction methods.