Robust Risk Prediction from Noisy Data
Nathan Stromberg
Abstract:
Risk prediction models in healthcare are often trained in a supervised fashion by learning a mapping from a set of clinical variables to an outcome of interest. For example, self-reported symptom data is combined with disease diagnosis to train
predictive models for characterizing disease risk from symptoms; however, in practice, these data are often noisy. Arguably, the most common approach to characterizing risk in clinical applications is via the logistic model owing to its interpretability. We present a method for robust
risk prediction using
a hyperparameterized loss function in
the logistic model which preserves not only the output
predicted probabilities (a proxy for a composite risk score), but also the odds ratios of the model (a proxy for covariate-level risk) under unknown amounts of label noise. We demonstrate the efficacy of our method on a synthetic Gaussian mixture dataset and a large COVID-19 self-reported survey dataset.