Abstract:Matrix factorization exploits the idea that, in complex high-dimensional data, the actual signal typically lies in lower-dimensional structures. These lower dimensional objects provide useful insight, with interpretability favored by sparse structures. Sparsity, in addition, is beneficial in terms of regularization and, thus, to avoid over-fitting. By exploiting Bayesian shrinkage priors, we devise a computationally convenient approach for high-dimensional matrix factorization. The dependence between row and column entities is modeled by inducing flexible sparse patterns within factors. The availability of external information is accounted for in such a way that structures are allowed while not imposed. Inspired by boosting algorithms, we pair the the proposed approach with a numerical strategy relying on a sequential inclusion and estimation of low-rank contributions, with data-driven stopping rule. Practical advantages of the proposed approach are demonstrated by means of a simulation study and the analysis of soccer heatmaps obtained from new generation tracking data.
Abstract:We propose a simple yet powerful framework for modeling integer-valued data. The integer-valued data are modeled by Simultaneously Transforming And Rounding (STAR) a continuous-valued process, where the transformation may be known or learned from the data. Implicitly, STAR formalizes the commonly-applied yet incoherent procedure of (i) transforming integer-valued data and subsequently (ii) modeling the transformed data using Gaussian models. Importantly, STAR is well-defined for integer-valued data, which is reflected in predictive accuracy, and is designed to account for zero-inflation, bounded or censored data, and over- or underdispersion. Efficient computation is available via an MCMC algorithm, which provides a mechanism for direct adaptation of successful Bayesian methods for continuous data to the integer-valued data setting. Using the STAR framework, we develop new linear regression models, additive models, and Bayesian Additive Regression Trees (BART) for integer-valued data, which demonstrate substantial improvements in performance relative to existing regression models for a variety of simulated and real datasets.