Abstract:Deep learning fostered a leap ahead in automated melanoma screening in the last two years. Those models, however, are expensive to train and difficult to parameterize. Objective: We investigate methodological issues for designing and evaluating deep learning models for melanoma detection. We explore ten choices faced by researchers: use of transfer learning, model architecture, train dataset, image resolution, type of data augmentation, input normalization, use of segmentation, duration of training, additional use of SVM, and test data augmentation. Methods: We perform two full factorial experiment, for five different test datasets, resulting in 2560 exhaustive trials in our main experiment, and 1280 trials in our assessment of transfer learning. We analyze both with multi-way ANOVA. We use the exhaustive trials to simulate sequential decisions and ensembles, with and without the use of privileged information from the test set. Results - main experiment: Amount of train data has disproportionate influence, explaining almost half the variation in performance. Of the other factors, test data augmentation and input resolution are the most influential. Deeper models, when combined, with extra data, also help. - transfer experiment: Transfer learning is critical, its absence brings huge performance penalties. - simulations: Ensembles of models are the best option to provide reliable results with limited resources, without using privileged information and sacrificing methodological rigor. Conclusions and Significance: Advancing research on automated melanoma screening requires curating larger public datasets. Indirect use of privileged information from the test set to design the models is a subtle, but frequent methodological mistake that leads to overoptimistic results. Ensembles of models are a cost-effective alternative to the expensive full-factorial and to the unstable sequential designs.
Abstract:Knowledge transfer impacts the performance of deep learning -- the state of the art for image classification tasks, including automated melanoma screening. Deep learning's greed for large amounts of training data poses a challenge for medical tasks, which we can alleviate by recycling knowledge from models trained on different tasks, in a scheme called transfer learning. Although much of the best art on automated melanoma screening employs some form of transfer learning, a systematic evaluation was missing. Here we investigate the presence of transfer, from which task the transfer is sourced, and the application of fine tuning (i.e., retraining of the deep learning model after transfer). We also test the impact of picking deeper (and more expensive) models. Our results favor deeper models, pre-trained over ImageNet, with fine-tuning, reaching an AUC of 80.7% and 84.5% for the two skin-lesion datasets evaluated.
Abstract:In this paper we survey, analyze and criticize current art on automated melanoma screening, reimplementing a baseline technique, and proposing two novel ones. Melanoma, although highly curable when detected early, ends as one of the most dangerous types of cancer, due to delayed diagnosis and treatment. Its incidence is soaring, much faster than the number of trained professionals able to diagnose it. Automated screening appears as an alternative to make the most of those professionals, focusing their time on the patients at risk while safely discharging the other patients. However, the potential of automated melanoma diagnosis is currently unfulfilled, due to the emphasis of current literature on outdated computer vision models. Even more problematic is the irreproducibility of current art. We show how streamlined pipelines based upon current Computer Vision outperform conventional models - a model based on an advanced bags of words reaches an AUC of 84.6%, and a model based on deep neural networks reaches 89.3%, while the baseline (a classical bag of words) stays at 81.2%. We also initiate a dialog to improve reproducibility in our community