Abstract:A frequent problem in statistical science is how to properly handle missing data in matched paired observations. There is a large body of literature coping with the univariate case. Yet, the ongoing technological progress in measuring biological systems raises the need for addressing more complex data, e.g., graphs, strings and probability distributions, among others. In order to fill this gap, this paper proposes new estimators of the maximum mean discrepancy (MMD) to handle complex matched pairs with missing data. These estimators can detect differences in data distributions under different missingness mechanisms. The validity of this approach is proven and further studied in an extensive simulation study, and results of statistical consistency are provided. Data from continuous glucose monitoring in a longitudinal population-based diabetes study are used to illustrate the application of this approach. By employing the new distributional representations together with cluster analysis, new clinical criteria on how glucose changes vary at the distributional level over five years can be explored.
Abstract:AEGIS study possesses unique information on longitudinal changes in circulating glucose through continuous glucose monitoring technology (CGM). However, as usual in longitudinal medical studies, there is a significant amount of missing data in the outcome variables. For example, 40 percent of glycosylated hemoglobin (A1C) biomarker data are missing five years ahead. With the purpose to reduce the impact of this issue, this article proposes a new data analysis framework based on learning in reproducing kernel Hilbert spaces (RKHS) with missing responses that allows to capture non-linear relations between variable studies in different supervised modeling tasks. First, we extend the Hilbert-Schmidt dependence measure to test statistical independence in this context introducing a new bootstrap procedure, for which we prove consistency. Next, we adapt or use existing models of variable selection, regression, and conformal inference to obtain new clinical findings about glucose changes five years ahead with the AEGIS data. The most relevant findings are summarized below: i) We identify new factors associated with long-term glucose evolution; ii) We show the clinical sensibility of CGM data to detect changes in glucose metabolism; iii) We can improve clinical interventions based on our algorithms' expected glucose changes according to patients' baseline characteristics.