Abstract:van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing.
Abstract:This paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upper- and lower-bounds for $L_2$-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments.
Abstract:Several recent works have considered the \emph{trace reconstruction problem}, in which an unknown source string $x\in\{0,1\}^n$ is transmitted through a probabilistic channel which may randomly delete coordinates or insert random bits, resulting in a \emph{trace} of $x$. The goal is to reconstruct the original string~$x$ from independent traces of $x$. While the best algorithms known for worst-case strings use $\exp(O(n^{1/3}))$ traces \cite{DOS17,NazarovPeres17}, highly efficient algorithms are known \cite{PZ17,HPP18} for the \emph{average-case} version, in which $x$ is uniformly random. We consider a generalization of this average-case trace reconstruction problem, which we call \emph{average-case population recovery in the presence of insertions and deletions}. In this problem, there is an unknown distribution $\cal{D}$ over $s$ unknown source strings $x^1,\dots,x^s \in \{0,1\}^n$, and each sample is independently generated by drawing some $x^i$ from $\cal{D}$ and returning an independent trace of $x^i$. Building on \cite{PZ17} and \cite{HPP18}, we give an efficient algorithm for this problem. For any support size $s \leq \smash{\exp(\Theta(n^{1/3}))}$, for a $1-o(1)$ fraction of all $s$-element support sets $\{x^1,\dots,x^s\} \subset \{0,1\}^n$, for every distribution $\cal{D}$ supported on $\{x^1,\dots,x^s\}$, our algorithm efficiently recovers ${\cal D}$ up to total variation distance $\epsilon$ with high probability, given access to independent traces of independent draws from $\cal{D}$. The algorithm runs in time poly$(n,s,1/\epsilon)$ and its sample complexity is poly$(s,1/\epsilon,\exp(\log^{1/3}n)).$ This polynomial dependence on the support size $s$ is in sharp contrast with the \emph{worst-case} version (when $x^1,\dots,x^s$ may be any strings in $\{0,1\}^n$), in which the sample complexity of the most efficient known algorithm \cite{BCFSS19} is doubly exponential in $s$.
Abstract:What kinds of functions are learnable from their satisfying assignments? Motivated by this simple question, we extend the framework of De, Diakonikolas, and Servedio [DDS15], which studied the learnability of probability distributions over $\{0,1\}^n$ defined by the set of satisfying assignments to "low-complexity" Boolean functions, to Boolean-valued functions defined over continuous domains. In our learning scenario there is a known "background distribution" $\mathcal{D}$ over $\mathbb{R}^n$ (such as a known normal distribution or a known log-concave distribution) and the learner is given i.i.d. samples drawn from a target distribution $\mathcal{D}_f$, where $\mathcal{D}_f$ is $\mathcal{D}$ restricted to the satisfying assignments of an unknown low-complexity Boolean-valued function $f$. The problem is to learn an approximation $\mathcal{D}'$ of the target distribution $\mathcal{D}_f$ which has small error as measured in total variation distance. We give a range of efficient algorithms and hardness results for this problem, focusing on the case when $f$ is a low-degree polynomial threshold function (PTF). When the background distribution $\mathcal{D}$ is log-concave, we show that this learning problem is efficiently solvable for degree-1 PTFs (i.e.,~linear threshold functions) but not for degree-2 PTFs. In contrast, when $\mathcal{D}$ is a normal distribution, we show that this learning problem is efficiently solvable for degree-2 PTFs but not for degree-4 PTFs. Our hardness results rely on standard assumptions about secure signature schemes.
Abstract:We study density estimation for classes of shift-invariant distributions over $\mathbb{R}^d$. A multidimensional distribution is "shift-invariant" if, roughly speaking, it is close in total variation distance to a small shift of it in any direction. Shift-invariance relaxes smoothness assumptions commonly used in non-parametric density estimation to allow jump discontinuities. The different classes of distributions that we consider correspond to different rates of tail decay. For each such class we give an efficient algorithm that learns any distribution in the class from independent samples with respect to total variation distance. As a special case of our general result, we show that $d$-dimensional shift-invariant distributions which satisfy an exponential tail bound can be learned to total variation distance error $\epsilon$ using $\tilde{O}_d(1/ \epsilon^{d+2})$ examples and $\tilde{O}_d(1/ \epsilon^{2d+2})$ time. This implies that, for constant $d$, multivariate log-concave distributions can be learned in $\tilde{O}_d(1/\epsilon^{2d+2})$ time using $\tilde{O}_d(1/\epsilon^{d+2})$ samples, answering a question of [Diakonikolas, Kane and Stewart, 2016] All of our results extend to a model of noise-tolerant density estimation using Huber's contamination model, in which the target distribution to be learned is a $(1-\epsilon,\epsilon)$ mixture of some unknown distribution in the class with some other arbitrary and unknown distribution, and the learning algorithm must output a hypothesis distribution with total variation distance error $O(\epsilon)$ from the target distribution. We show that our general results are close to best possible by proving a simple $\Omega\left(1/\epsilon^d\right)$ information-theoretic lower bound on sample complexity even for learning bounded distributions that are shift-invariant.
Abstract:We consider a basic problem in unsupervised learning: learning an unknown \emph{Poisson Binomial Distribution}. A Poisson Binomial Distribution (PBD) over $\{0,1,\dots,n\}$ is the distribution of a sum of $n$ independent Bernoulli random variables which may have arbitrary, potentially non-equal, expectations. These distributions were first studied by S. Poisson in 1837 \cite{Poisson:37} and are a natural $n$-parameter generalization of the familiar Binomial Distribution. Surprisingly, prior to our work this basic learning problem was poorly understood, and known results for it were far from optimal. We essentially settle the complexity of the learning problem for this basic class of distributions. As our first main result we give a highly efficient algorithm which learns to $\eps$-accuracy (with respect to the total variation distance) using $\tilde{O}(1/\eps^3)$ samples \emph{independent of $n$}. The running time of the algorithm is \emph{quasilinear} in the size of its input data, i.e., $\tilde{O}(\log(n)/\eps^3)$ bit-operations. (Observe that each draw from the distribution is a $\log(n)$-bit string.) Our second main result is a {\em proper} learning algorithm that learns to $\eps$-accuracy using $\tilde{O}(1/\eps^2)$ samples, and runs in time $(1/\eps)^{\poly (\log (1/\eps))} \cdot \log n$. This is nearly optimal, since any algorithm {for this problem} must use $\Omega(1/\eps^2)$ samples. We also give positive and negative results for some extensions of this learning problem to weighted sums of independent Bernoulli random variables.
Abstract:Let $p$ be an unknown and arbitrary probability distribution over $[0,1)$. We consider the problem of {\em density estimation}, in which a learning algorithm is given i.i.d. draws from $p$ and must (with high probability) output a hypothesis distribution that is close to $p$. The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function. In more detail, for any $k$ and $\epsilon$, we give an algorithm that makes $\tilde{O}(k/\epsilon^2)$ draws from $p$, runs in $\tilde{O}(k/\epsilon^2)$ time, and outputs a hypothesis distribution $h$ that is piecewise constant with $O(k \log^2(1/\epsilon))$ pieces. With high probability the hypothesis $h$ satisfies $d_{\mathrm{TV}}(p,h) \leq C \cdot \mathrm{opt}_k(p) + \epsilon$, where $d_{\mathrm{TV}}$ denotes the total variation distance (statistical distance), $C$ is a universal constant, and $\mathrm{opt}_k(p)$ is the smallest total variation distance between $p$ and any $k$-piecewise constant distribution. The sample size and running time of our algorithm are optimal up to logarithmic factors. The "approximation factor" $C$ in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of $k$ and $\epsilon$ can achieve $C<2$ regardless of what kind of hypothesis distribution it uses.
Abstract:Monotone Boolean functions, and the monotone Boolean circuits that compute them, have been intensively studied in complexity theory. In this paper we study the structure of Boolean functions in terms of the minimum number of negations in any circuit computing them, a complexity measure that interpolates between monotone functions and the class of all functions. We study this generalization of monotonicity from the vantage point of learning theory, giving near-matching upper and lower bounds on the uniform-distribution learnability of circuits in terms of the number of negations they contain. Our upper bounds are based on a new structural characterization of negation-limited circuits that extends a classical result of A. A. Markov. Our lower bounds, which employ Fourier-analytic tools from hardness amplification, give new results even for circuits with no negations (i.e. monotone functions).
Abstract:A $k$-modal probability distribution over the discrete domain $\{1,...,n\}$ is one whose histogram has at most $k$ "peaks" and "valleys." Such distributions are natural generalizations of monotone ($k=0$) and unimodal ($k=1$) probability distributions, which have been intensively studied in probability theory and statistics. In this paper we consider the problem of \emph{learning} (i.e., performing density estimation of) an unknown $k$-modal distribution with respect to the $L_1$ distance. The learning algorithm is given access to independent samples drawn from an unknown $k$-modal distribution $p$, and it must output a hypothesis distribution $\widehat{p}$ such that with high probability the total variation distance between $p$ and $\widehat{p}$ is at most $\epsilon.$ Our main goal is to obtain \emph{computationally efficient} algorithms for this problem that use (close to) an information-theoretically optimal number of samples. We give an efficient algorithm for this problem that runs in time $\mathrm{poly}(k,\log(n),1/\epsilon)$. For $k \leq \tilde{O}(\log n)$, the number of samples used by our algorithm is very close (within an $\tilde{O}(\log(1/\epsilon))$ factor) to being information-theoretically optimal. Prior to this work computationally efficient algorithms were known only for the cases $k=0,1$ \cite{Birge:87b,Birge:97}. A novel feature of our approach is that our learning algorithm crucially uses a new algorithm for \emph{property testing of probability distributions} as a key subroutine. The learning algorithm uses the property tester to efficiently decompose the $k$-modal distribution into $k$ (near-)monotone distributions, which are easier to learn.
Abstract:We give a highly efficient "semi-agnostic" algorithm for learning univariate probability distributions that are well approximated by piecewise polynomial density functions. Let $p$ be an arbitrary distribution over an interval $I$ which is $\tau$-close (in total variation distance) to an unknown probability distribution $q$ that is defined by an unknown partition of $I$ into $t$ intervals and $t$ unknown degree-$d$ polynomials specifying $q$ over each of the intervals. We give an algorithm that draws $\tilde{O}(t\new{(d+1)}/\eps^2)$ samples from $p$, runs in time $\poly(t,d,1/\eps)$, and with high probability outputs a piecewise polynomial hypothesis distribution $h$ that is $(O(\tau)+\eps)$-close (in total variation distance) to $p$. This sample complexity is essentially optimal; we show that even for $\tau=0$, any algorithm that learns an unknown $t$-piecewise degree-$d$ probability distribution over $I$ to accuracy $\eps$ must use $\Omega({\frac {t(d+1)} {\poly(1 + \log(d+1))}} \cdot {\frac 1 {\eps^2}})$ samples from the distribution, regardless of its running time. Our algorithm combines tools from approximation theory, uniform convergence, linear programming, and dynamic programming. We apply this general algorithm to obtain a wide range of results for many natural problems in density estimation over both continuous and discrete domains. These include state-of-the-art results for learning mixtures of log-concave distributions; mixtures of $t$-modal distributions; mixtures of Monotone Hazard Rate distributions; mixtures of Poisson Binomial Distributions; mixtures of Gaussians; and mixtures of $k$-monotone densities. Our general technique yields computationally efficient algorithms for all these problems, in many cases with provably optimal sample complexities (up to logarithmic factors) in all parameters.