Abstract:Diffusion-based generative models provide a powerful framework for learning to sample from a complex target distribution. The remarkable empirical success of these models applied to high-dimensional signals, including images and video, stands in stark contrast to classical results highlighting the curse of dimensionality for distribution recovery. In this work, we take a step towards understanding this gap through a careful analysis of learning diffusion models over the Barron space of single layer neural networks. In particular, we show that these shallow models provably adapt to simple forms of low dimensional structure, thereby avoiding the curse of dimensionality. We combine our results with recent analyses of sampling with diffusion models to provide an end-to-end sample complexity bound for learning to sample from structured distributions. Importantly, our results do not require specialized architectures tailored to particular latent structures, and instead rely on the low-index structure of the Barron space to adapt to the underlying distribution.
Abstract:A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and statistically dependent within its source, violating vital assumptions in existing theoretical studies. Toward addressing these issues, we establish statistical guarantees for learning general $\textit{nonlinear}$ representations from multiple data sources that admit different input distributions and possibly dependent data. Specifically, we study the sample-complexity of learning $T+1$ functions $f_\star^{(t)} \circ g_\star$ from a function class $\mathcal F \times \mathcal G$, where $f_\star^{(t)}$ are task specific linear functions and $g_\star$ is a shared nonlinear representation. A representation $\hat g$ is estimated using $N$ samples from each of $T$ source tasks, and a fine-tuning function $\hat f^{(0)}$ is fit using $N'$ samples from a target task passed through $\hat g$. We show that when $N \gtrsim C_{\mathrm{dep}} (\mathrm{dim}(\mathcal F) + \mathrm{C}(\mathcal G)/T)$, the excess risk of $\hat f^{(0)} \circ \hat g$ on the target task decays as $\nu_{\mathrm{div}} \big(\frac{\mathrm{dim}(\mathcal F)}{N'} + \frac{\mathrm{C}(\mathcal G)}{N T} \big)$, where $C_{\mathrm{dep}}$ denotes the effect of data dependency, $\nu_{\mathrm{div}}$ denotes an (estimatable) measure of $\textit{task-diversity}$ between the source and target tasks, and $\mathrm C(\mathcal G)$ denotes the complexity of the representation class $\mathcal G$. In particular, our analysis reveals: as the number of tasks $T$ increases, both the sample requirement and risk bound converge to that of $r$-dimensional regression as if $g_\star$ had been given, and the effect of dependency only enters the sample requirement, leaving the risk bound matching the iid setting.
Abstract:In this note, we give a short information-theoretic proof of the consistency of the Gaussian maximum likelihood estimator in linear auto-regressive models. Our proof yields nearly optimal non-asymptotic rates for parameter recovery and works without any invocation of stability in the case of finite hypothesis classes.
Abstract:Model-based reinforcement learning is an effective approach for controlling an unknown system. It is based on a longstanding pipeline familiar to the control community in which one performs experiments on the environment to collect a dataset, uses the resulting dataset to identify a model of the system, and finally performs control synthesis using the identified model. As interacting with the system may be costly and time consuming, targeted exploration is crucial for developing an effective control-oriented model with minimal experimentation. Motivated by this challenge, recent work has begun to study finite sample data requirements and sample efficient algorithms for the problem of optimal exploration in model-based reinforcement learning. However, existing theory and algorithms are limited to model classes which are linear in the parameters. Our work instead focuses on models with nonlinear parameter dependencies, and presents the first finite sample analysis of an active learning algorithm suitable for a general class of nonlinear dynamics. In certain settings, the excess control cost of our algorithm achieves the optimal rate, up to logarithmic factors. We validate our approach in simulation, showcasing the advantage of active, control-oriented exploration for controlling nonlinear systems.
Abstract:We study the quadratic prediction error method -- i.e., nonlinear least squares -- for a class of time-varying parametric predictor models satisfying a certain identifiability condition. While this method is known to asymptotically achieve the optimal rate for a wide range of problems, there have been no non-asymptotic results matching these optimal rates outside of a select few, typically linear, model classes. By leveraging modern tools from learning with dependent data, we provide the first rate-optimal non-asymptotic analysis of this method for our more general setting of nonlinearly parametrized model classes. Moreover, we show that our results can be applied to a particular class of identifiable AutoRegressive Moving Average (ARMA) models, resulting in the first optimal non-asymptotic rates for identification of ARMA models.
Abstract:In this work, we study statistical learning with dependent ($\beta$-mixing) data and square loss in a hypothesis class $\mathscr{F}\subset L_{\Psi_p}$ where $\Psi_p$ is the norm $\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $ for some $p\in [2,\infty]$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent data. Absent any realizability assumption, typical non-asymptotic results exhibit variance proxies that are deflated \emph{multiplicatively} by the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$ -- that is, $\mathscr{F}$ is a weakly sub-Gaussian class: $\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$ for some $\eta\in (0,1]$ -- the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. Our result holds whether the problem is realizable or not and we refer to this as a \emph{near mixing-free rate}, since direct dependence on mixing is relegated to an additive higher order term. We arrive at our result by combining the above notion of a weakly sub-Gaussian class with mixed tail generic chaining. This combination allows us to compute sharp, instance-optimal rates for a wide range of problems. %Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include sub-Gaussian linear regression, more general smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.
Abstract:This tutorial serves as an introduction to recently developed non-asymptotic methods in the theory of -- mainly linear -- system identification. We emphasize tools we deem particularly useful for a range of problems in this domain, such as the covering technique, the Hanson-Wright Inequality and the method of self-normalized martingales. We then employ these tools to give streamlined proofs of the performance of various least-squares based estimators for identifying the parameters in autoregressive models. We conclude by sketching out how the ideas presented herein can be extended to certain nonlinear identification problems.
Abstract:We derive upper bounds for random design linear regression with dependent ($\beta$-mixing) data absent any realizability assumptions. In contrast to the strictly realizable martingale noise regime, no sharp instance-optimal non-asymptotics are available in the literature. Up to constant factors, our analysis correctly recovers the variance term predicted by the Central Limit Theorem -- the noise level of the problem -- and thus exhibits graceful degradation as we introduce misspecification. Past a burn-in, our result is sharp in the moderate deviations regime, and in particular does not inflate the leading order term by mixing time factors.
Abstract:We present a simple proof for bounding the smallest eigenvalue of the empirical covariance in a causal Gaussian process. Along the way, we establish a one-sided tail inequality for Gaussian quadratic forms using a causal decomposition. Our proof only uses elementary facts about the Gaussian distribution and the union bound.
Abstract:This tutorial survey provides an overview of recent non-asymptotic advances in statistical learning theory as relevant to control and system identification. While there has been substantial progress across all areas of control, the theory is most well-developed when it comes to linear system identification and learning for the linear quadratic regulator, which are the focus of this manuscript. From a theoretical perspective, much of the labor underlying these advances has been in adapting tools from modern high-dimensional statistics and learning theory. While highly relevant to control theorists interested in integrating tools from machine learning, the foundational material has not always been easily accessible. To remedy this, we provide a self-contained presentation of the relevant material, outlining all the key ideas and the technical machinery that underpin recent results. We also present a number of open problems and future directions.