SODA, IP Paris
Abstract:Survival analysis deals with modeling the time until an event occurs, and accurate probability estimates are crucial for decision-making, particularly in the competing-risks setting where multiple events are possible. While recent work has addressed calibration in standard survival analysis, the competing-risks setting remains under-explored as it is harder (the calibration applies to both probabilities across classes and time horizon). We show that existing calibration measures are not suited to the competing-risk setting and that recent models do not give well-behaved probabilities. To address this, we introduce a dedicated framework with two novel calibration measures that are minimized for oracle estimators (i.e., both measures are proper). We also introduce some methods to estimate, test, and correct the calibration. Our recalibration methods yield good probabilities while preserving discrimination.
Abstract:Mediation analysis breaks down the causal effect of a treatment on an outcome into an indirect effect, acting through a third group of variables called mediators, and a direct effect, operating through other mechanisms. Mediation analysis is hard because confounders between treatment, mediators, and outcome blur effect estimates in observational studies. Many estimators have been proposed to adjust on those confounders and provide accurate causal estimates. We consider parametric and non-parametric implementations of classical estimators and provide a thorough evaluation for the estimation of the direct and indirect effects in the context of causal mediation analysis for binary, continuous, and multi-dimensional mediators. We assess several approaches in a comprehensive benchmark on simulated data. Our results show that advanced statistical approaches such as the multiply robust and the double machine learning estimators achieve good performances in most of the simulated settings and on real data. As an example of application, we propose a thorough analysis of factors known to influence cognitive functions to assess if the mechanism involves modifications in brain morphology using the UK Biobank brain imaging cohort. This analysis shows that for several physiological factors, such as hypertension and obesity, a substantial part of the effect is mediated by changes in the brain structure. This work provides guidance to the practitioner from the formulation of a valid causal mediation problem, including the verification of the identification assumptions, to the choice of an adequate estimator.
Abstract:When dealing with right-censored data, where some outcomes are missing due to a limited observation period, survival analysis -- known as time-to-event analysis -- focuses on predicting the time until an event of interest occurs. Multiple classes of outcomes lead to a classification variant: predicting the most likely event, a less explored area known as competing risks. Classic competing risks models couple architecture and loss, limiting scalability.To address these issues, we design a strictly proper censoring-adjusted separable scoring rule, allowing optimization on a subset of the data as each observation is evaluated independently. The loss estimates outcome probabilities and enables stochastic optimization for competing risks, which we use for efficient gradient boosting trees. SurvivalBoost not only outperforms 12 state-of-the-art models across several metrics on 4 real-life datasets, both in competing risks and survival settings, but also provides great calibration, the ability to predict across any time horizon, and computation times faster than existing methods.
Abstract:When data are right-censored, i.e. some outcomes are missing due to a limited period of observation, survival analysis can compute the "time to event". Multiple classes of outcomes lead to a classification variant: predicting the most likely event, known as competing risks, which has been less studied. To build a loss that estimates outcome probabilities for such settings, we introduce a strictly proper censoring-adjusted separable scoring rule that can be optimized on a subpart of the data because the evaluation is made independently of observations. It enables stochastic optimization for competing risks which we use to train gradient boosting trees. Compared to 11 state-of-the-art models, this model, MultiIncidence, performs best in estimating the probability of outcomes in survival and competing risks. It can predict at any time horizon and is much faster than existing alternatives.