Abstract:In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?
Abstract:In this article, we consider the problem of approximating a finite set of data (usually huge in applications) by invariant subspaces generated through a small set of smooth functions. The invariance is either by translations under a full-rank lattice or through the action of crystallographic groups. Smoothness is ensured by stipulating that the generators belong to a Paley-Wiener space, that is selected in an optimal way based on the characteristics of the given data. To complete our investigation, we analyze the fundamental role played by the lattice in the process of approximation.
Abstract:In this paper we consider the problem of reconstructing an image that is downsampled in the space of its $SE(2)$ wavelet transform, which is motivated by classical models of simple cells receptive fields and feature preference maps in primary visual cortex. We prove that, whenever the problem is solvable, the reconstruction can be obtained by an elementary project and replace iterative scheme based on the reproducing kernel arising from the group structure, and show numerical results on real images.
Abstract:We provide the construction of a set of square matrices whose translates and rotates provide a Parseval frame that is optimal for approximating a given dataset of images. Our approach is based on abstract harmonic analysis techniques. Optimality is considered with respect to the quadratic error of approximation of the images in the dataset with their projection onto a linear subspace that is invariant under translations and rotations. In addition, we provide an elementary and fully self-contained proof of optimality, and the numerical results from datasets of natural images.
Abstract:Some geometric properties of the wavelet analysis performed by visual neurons are discussed and compared with experimental data. In particular, several relationships between the cortical morphologies and the parametric dependencies of extracted features are formalized and considered from a harmonic analysis point of view.
Abstract:The visual systems of many mammals, including humans, is able to integrate the geometric information of visual stimuli and to perform cognitive tasks already at the first stages of the cortical processing. This is thought to be the result of a combination of mechanisms, which include feature extraction at single cell level and geometric processing by means of cells connectivity. We present a geometric model of such connectivities in the space of detected features associated to spatio-temporal visual stimuli, and show how they can be used to obtain low-level object segmentation. The main idea is that of defining a spectral clustering procedure with anisotropic affinities over datasets consisting of embeddings of the visual stimuli into higher dimensional spaces. Neural plausibility of the proposed arguments will be discussed.