Abstract:This paper investigates the accuracy of generative models and the impact of knowledge transfer on their generation precision. Specifically, we examine a generative model for a target task, fine-tuned using a pre-trained model from a source task. Building on the "Shared Embedding" concept, which bridges the source and target tasks, we introduce a novel framework for transfer learning under distribution metrics such as the Kullback-Leibler divergence. This framework underscores the importance of leveraging inherent similarities between diverse tasks despite their distinct data distributions. Our theory suggests that the shared structures can augment the generation accuracy for a target task, reliant on the capability of a source model to identify shared structures and effective knowledge transfer from source to target learning. To demonstrate the practical utility of this framework, we explore the theoretical implications for two specific generative models: diffusion and normalizing flows. The results show enhanced performance in both models over their non-transfer counterparts, indicating advancements for diffusion models and providing fresh insights into normalizing flows in transfer and non-transfer settings. These results highlight the significant contribution of knowledge transfer in boosting the generation capabilities of these models.
Abstract:This paper presents FlowSUM, a normalizing flows-based variational encoder-decoder framework for Transformer-based summarization. Our approach tackles two primary challenges in variational summarization: insufficient semantic information in latent representations and posterior collapse during training. To address these challenges, we employ normalizing flows to enable flexible latent posterior modeling, and we propose a controlled alternate aggressive training (CAAT) strategy with an improved gate mechanism. Experimental results show that FlowSUM significantly enhances the quality of generated summaries and unleashes the potential for knowledge distillation with minimal impact on inference time. Furthermore, we investigate the issue of posterior collapse in normalizing flows and analyze how the summary quality is affected by the training strategy, gate initialization, and the type and number of normalizing flows used, offering valuable insights for future research.
Abstract:Synthetic data generation, a cornerstone of Generative Artificial Intelligence, signifies a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data gains prominence, questions arise concerning the accuracy of statistical methods when applied to synthetic data compared to raw data. In this article, we introduce the Synthetic Data Generation for Analytics framework. This framework employs statistical methods on high-fidelity synthetic data generated by advanced models such as tabular diffusion and Generative Pre-trained Transformer models. These models, trained on raw data, are further enhanced with insights from pertinent studies. A significant discovery within this framework is the generational effect: the error of a statistical method on synthetic data initially diminishes with added synthetic data but may eventually increase or plateau. This phenomenon, rooted in the complexities of replicating raw data distributions, highlights a "reflection point"--an optimal threshold in the size of synthetic data determined by specific error metrics. Through three illustrative case studies-sentiment analysis of texts, predictive modeling of structured data, and inference in tabular data--we demonstrate the effectiveness of this framework over traditional ones. We underline its potential to amplify various statistical methods, including gradient boosting for prediction and hypothesis testing, thereby underscoring the transformative potential of synthetic data generation in data science.
Abstract:This article presents a novel method for causal discovery with generalized structural equation models suited for analyzing diverse types of outcomes, including discrete, continuous, and mixed data. Causal discovery often faces challenges due to unmeasured confounders that hinder the identification of causal relationships. The proposed approach addresses this issue by developing two peeling algorithms (bottom-up and top-down) to ascertain causal relationships and valid instruments. This approach first reconstructs a super-graph to represent ancestral relationships between variables, using a peeling algorithm based on nodewise GLM regressions that exploit relationships between primary and instrumental variables. Then, it estimates parent-child effects from the ancestral relationships using another peeling algorithm while deconfounding a child's model with information borrowed from its parents' models. The article offers a theoretical analysis of the proposed approach, which establishes conditions for model identifiability and provides statistical guarantees for accurately discovering parent-child relationships via the peeling algorithms. Furthermore, the article presents numerical experiments showcasing the effectiveness of our approach in comparison to state-of-the-art structure learning methods without confounders. Lastly, it demonstrates an application to Alzheimer's disease (AD), highlighting the utility of the method in constructing gene-to-gene and gene-to-disease regulatory networks involving Single Nucleotide Polymorphisms (SNPs) for healthy and AD subjects.
Abstract:This paper introduces a novel generator called Perturbation-Assisted Sample Synthesis (PASS), designed for drawing reliable conclusions from complex data, especially when using advanced modeling techniques like deep neural networks. PASS utilizes perturbation to generate synthetic data that closely mirrors the distribution of raw data, encompassing numerical and unstructured data types such as gene expression, images, and text. By estimating the data-generating distribution and leveraging large pre-trained generative models, PASS enhances estimation accuracy, providing an estimated distribution of any statistic through Monte Carlo experiments. Building on PASS, we propose a generative inference framework called Perturbation-Assisted Inference (PAI), which offers a statistical guarantee of validity. In pivotal inference, PAI enables accurate conclusions without knowing a pivotal's distribution as in simulations, even with limited data. In non-pivotal situations, we train PASS using an independent holdout sample, resulting in credible conclusions. To showcase PAI's capability in tackling complex problems, we highlight its applications in three domains: image synthesis inference, sentiment word inference, and multimodal inference via stable diffusion.
Abstract:In explainable artificial intelligence, discriminative feature localization is critical to reveal a blackbox model's decision-making process from raw data to prediction. In this article, we use two real datasets, the MNIST handwritten digits and MIT-BIH Electrocardiogram (ECG) signals, to motivate key characteristics of discriminative features, namely adaptiveness, predictive importance and effectiveness. Then, we develop a localization framework based on adversarial attacks to effectively localize discriminative features. In contrast to existing heuristic methods, we also provide a statistically guaranteed interpretability of the localized features by measuring a generalized partial $R^2$. We apply the proposed method to the MNIST dataset and the MIT-BIH dataset with a convolutional auto-encoder. In the first, the compact image regions localized by the proposed method are visually appealing. Similarly, in the second, the identified ECG features are biologically plausible and consistent with cardiac electrophysiological principles while locating subtle anomalies in a QRS complex that may not be discernible by the naked eye. Overall, the proposed method compares favorably with state-of-the-art competitors. Accompanying this paper is a Python library dnn-locate (https://dnn-locate.readthedocs.io/en/latest/) that implements the proposed approach.
Abstract:Differential privacy is becoming one gold standard for protecting the privacy of publicly shared data. It has been widely used in social science, data science, public health, information technology, and the U.S. decennial census. Nevertheless, to guarantee differential privacy, existing methods may unavoidably alter the conclusion of original data analysis, as privatization often changes the sample distribution. This phenomenon is known as the trade-off between privacy protection and statistical accuracy. In this work, we break this trade-off by developing a distribution-invariant privatization (DIP) method to reconcile both high statistical accuracy and strict differential privacy. As a result, any downstream statistical or machine learning task yields essentially the same conclusion as if one used the original data. Numerically, under the same strictness of privacy protection, DIP achieves superior statistical accuracy in two simulations and on three real-world benchmarks.
Abstract:A recommender system learns to predict the user-specific preference or intention over many items simultaneously for all users, making personalized recommendations based on a relatively small number of observations. One central issue is how to leverage three-way interactions, referred to as user-item-stage dependencies on a monotonic chain of events, to enhance the prediction accuracy. A monotonic chain of events occurs, for instance, in an article sharing dataset, where a ``follow'' action implies a ``like'' action, which in turn implies a ``view'' action. In this article, we develop a multistage recommender system utilizing a two-level monotonic property characterizing a monotonic chain of events for personalized prediction. Particularly, we derive a large-margin classifier based on a nonnegative additive latent factor model in the presence of a high percentage of missing observations, particularly between stages, reducing the number of model parameters for personalized prediction while guaranteeing prediction consistency. On this ground, we derive a regularized cost function to learn user-specific behaviors at different stages, linking decision functions to numerical and categorical covariates to model user-item-stage interactions. Computationally, we derive an algorithm based on blockwise coordinate descent. Theoretically, we show that the two-level monotonic property enhances the accuracy of learning as compared to a standard method treating each stage individually and an ordinal method utilizing only one-level monotonicity. Finally, the proposed method compares favorably with existing methods in simulations and an article sharing dataset.
Abstract:An exciting recent development is the uptake of deep learning in many scientific fields, where the objective is seeking novel scientific insights and discoveries. To interpret a learning outcome, researchers perform hypothesis testing for explainable features to advance scientific domain knowledge. In such a situation, testing for a blackbox learner poses a severe challenge because of intractable models, unknown limiting distributions of parameter estimates, and high computational constraints. In this article, we derive two consistent tests for the feature relevance of a blackbox learner. The first one evaluates a loss difference with perturbation on an inference sample, which is independent of an estimation sample used for parameter estimation in model fitting. The second further splits the inference sample into two but does not require data perturbation. Also, we develop their combined versions by aggregating the order statistics of the $p$-values based on repeated sample splitting. To estimate the splitting ratio and the perturbation size, we develop adaptive splitting schemes for suitably controlling the Type \rom{1} error subject to computational constraints. By deflating the \textit{bias-sd-ratio}, we establish asymptotic null distributions of the test statistics and their consistency in terms of statistical power. Our theoretical power analysis and simulations indicate that the one-split test is more powerful than the two-split test, though the latter is easier to apply for large datasets. Moreover, the combined tests are more stable while compensating for a power loss by repeated sample splitting. Numerically, we demonstrate the utility of the proposed tests on two benchmark examples. Accompanying this paper is our Python library {\tt dnn-inference} https://dnn-inference.readthedocs.io/en/latest/ that implements the proposed tests.
Abstract:Recommender systems have been widely adopted by electronic commerce and entertainment industries for individualized prediction and recommendation, which benefit consumers and improve business intelligence. In this article, we propose an innovative method, namely the recommendation engine of multilayers (REM), for tensor recommender systems. The proposed method utilizes the structure of a tensor response to integrate information from multiple modes, and creates an additional layer of nested latent factors to accommodate between-subjects dependency. One major advantage is that the proposed method is able to address the "cold-start" issue in the absence of information from new customers, new products or new contexts. Specifically, it provides more effective recommendations through sub-group information. To achieve scalable computation, we develop a new algorithm for the proposed method, which incorporates a maximum block improvement strategy into the cyclic blockwise-coordinate-descent algorithm. In theory, we investigate both algorithmic properties for global and local convergence, along with the asymptotic consistency of estimated parameters. Finally, the proposed method is applied in simulations and IRI marketing data with 116 million observations of product sales. Numerical studies demonstrate that the proposed method outperforms existing competitors in the literature.