Abstract:Sleep disorder diagnosis relies on the analysis of polysomnography (PSG) records. Sleep stages are systematically determined as a preliminary step of this examination. In practice, sleep stage classification relies on the visual inspection of 30-seconds epochs of polysomnography signals. Numerous automatic approaches have been developed to replace this tedious and expensive task. Although these methods demonstrated better performance than human sleep experts on specific datasets, they remain largely unused in sleep clinics. The main reason is that each sleep clinic uses a specific PSG montage that most automatic approaches are unable to handle out-of-the-box. Moreover, even when the PSG montage is compatible, publications have shown that automatic approaches perform poorly on unseen data with different demographics. To address these issues, we introduce RobustSleepNet, a deep learning model for automatic sleep stage classification able to handle arbitrary PSG montages. We trained and evaluated this model in a leave-one-out-dataset fashion on a large corpus of 8 heterogeneous sleep staging datasets to make it robust to demographic changes. When evaluated on an unseen dataset, RobustSleepNet reaches 97% of the F1 of a model trained specifically on this dataset. We then show that finetuning RobustSleepNet, using a part of the unseen dataset, increase the F1 by 2% when compared to a model trained specifically for this dataset. Hence, RobustSleepNet unlocks the possibility to perform high-quality out-of-the-box automatic sleep staging with any clinical setup. It can also be finetuned to reach a state-of-the-art level of performance on a specific population.
Abstract:Sleep stage classification constitutes an important element of sleep disorder diagnosis. It relies on the visual inspection of polysomnography records by trained sleep technologists. Automated approaches have been designed to alleviate this resource-intensive task. However, such approaches are usually compared to a single human scorer annotation despite an inter-rater agreement of about 85 % only. The present study introduces two publicly-available datasets, DOD-H including 25 healthy volunteers and DOD-O including 55 patients suffering from obstructive sleep apnea (OSA). Both datasets have been scored by 5 sleep technologists from different sleep centers. We developed a framework to compare automated approaches to a consensus of multiple human scorers. Using this framework, we benchmarked and compared the main literature approaches. We also developed and benchmarked a new deep learning method, SimpleSleepNet, inspired by current state-of-the-art. We demonstrated that many methods can reach human-level performance on both datasets. SimpleSleepNet achieved an F1 of 89.9 % vs 86.8 % on average for human scorers on DOD-H, and an F1 of 88.3 % vs 84.8 % on DOD-O. Our study highlights that using state-of-the-art automated sleep staging outperforms human scorers performance for healthy volunteers and patients suffering from OSA. Consideration could be made to use automated approaches in the clinical setting.