Abstract:This study builds upon our previous work by introducing a refined Inductive Conformal Martingale (ICM) approach for addressing Concept Drift (CD). Specifically, we enhance our previously proposed CAUTIOUS betting function to incorporate multiple density estimators for improving detection ability. We also combine this betting function with two base estimators that have not been previously utilized within the ICM framework: the Interpolated Histogram and Nearest Neighbor Density Estimators. We assess these extensions using both a single ICM and an ensemble of ICMs. For the latter, we conduct a comprehensive experimental investigation into the influence of the ensemble size on prediction accuracy and the number of available predictions. Our experimental results on four benchmark datasets demonstrate that the proposed approach surpasses our previous methodology in terms of performance while matching or in many cases exceeding that of three contemporary state-of-the-art techniques.
Abstract:The impressive growth of smartphone devices in combination with the rising ubiquity of using mobile platforms for sensitive applications such as Internet banking, have triggered a rapid increase in mobile malware. In recent literature, many studies examine Machine Learning techniques, as the most promising approach for mobile malware detection, without however quantifying the uncertainty involved in their detections. In this paper, we address this problem by proposing a machine learning dynamic analysis approach that provides provably valid confidence guarantees in each malware detection. Moreover the particular guarantees hold for both the malicious and benign classes independently and are unaffected by any bias in the data. The proposed approach is based on a novel machine learning framework, called Conformal Prediction, combined with a random forests classifier. We examine its performance on a large-scale dataset collected by installing 1866 malicious and 4816 benign applications on a real android device. We make this collection of dynamic analysis data available to the research community. The obtained experimental results demonstrate the empirical validity, usefulness and unbiased nature of the outputs produced by the proposed approach.