Abstract:Background: Clinical management decisions for acutely decompensated CHF patients are often based on grades of pulmonary edema severity, rather than its mere absence or presence. The grading of pulmonary edema on chest radiographs is based on well-known radiologic findings. Purpose: We develop a clinical machine learning task to grade pulmonary edema severity and release both the underlying data and code to serve as a benchmark for future algorithmic developments in machine vision. Materials and Methods: We collected 369,071 chest radiographs and their associated radiology reports from 64,581 patients from the MIMIC-CXR chest radiograph dataset. We extracted pulmonary edema severity labels from the associated radiology reports as 4 ordinal levels: no edema (0), vascular congestion (1), interstitial edema (2), and alveolar edema (3). We developed machine learning models using two standard approaches: 1) a semi-supervised model using a variational autoencoder and 2) a pre-trained supervised learning model using a dense neural network. Results: We measured the area under the receiver operating characteristic curve (AUROC) from the semi-supervised model and the pre-trained model. AUROC for differentiating alveolar edema from no edema was 0.99 and 0.87 (semi-supervised and pre-trained models). Performance of the algorithm was inversely related to the difficulty in categorizing milder states of pulmonary edema: 2 vs 0 (0.88, 0.81), 1 vs 0 (0.79, 0.66), 3 vs 1 (0.93, 0.82), 2 vs 1 (0.69, 0.73), 3 vs 2 (0.88, 0.63). Conclusion: Accurate grading of pulmonary edema on chest radiographs is a clinically important task. Application of state-of-the-art machine learning techniques can produce a novel quantitative imaging biomarker from one of the oldest and most widely available imaging modalities.