Abstract:Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue. This is mainly due to the fact that many prevention solutions rely on signature-based detection methods that can easily be circumvented by hackers. Therefore, there is a recurrent need for behavior-based analysis where a suspicious file is ran in a secured environment and its traces are collected to reports for analysis. Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from these execution reports. Recently, Large Language Models and Generative AI have demonstrated impressive capabilities mainly in Natural Language Processing tasks and promising applications in the cybersecurity field for both attackers and defenders. In this paper, we design an Encoder-Only model, based on the Transformers architecture, to detect malicious files, digesting their API call sequences collected by an execution emulation solution. We are also limiting the size of the model architecture and the number of its parameters since it is often considered that Large Language Models may be overkill for specific tasks such as the one we are dealing with hereafter. In addition to achieving decent detection results, this approach has the advantage of reducing our carbon footprint by limiting training and inference times and facilitating technical operations with less hardware requirements. We also carry out some analysis of our results and highlight the limits and possible improvements when using Transformers to analyze malicious files.
Abstract:Existing research on malware detection focuses almost exclusively on the detection rate. However, in some cases, it is also important to understand the results of our algorithm, or to obtain more information, such as where to investigate in the file for an analyst. In this aim, we propose a new model to analyze Portable Executable files. Our method consists in splitting the files in different sections, then transform each section into an image, in order to train convolutional neural networks to treat specifically each identified section. Then we use all these scores returned by CNNs to compute a final detection score, using models that enable us to improve our analysis of the importance of each section in the final score.
Abstract:Malware detection is an important topic of current cybersecurity, and Machine Learning appears to be one of the main considered solutions even if certain problems to generalize to new malware remain. In the aim of exploring the potential of quantum machine learning on this domain, our previous work showed that quantum neural networks do not perform well on image-based malware detection when using a few qubits. In order to enhance the performances of our quantum algorithms for malware detection using images, without increasing the resources needed in terms of qubits, we implement a new preprocessing of our dataset using Grayscale method, and we couple it with a model composed of five distributed quantum convolutional networks and a scoring function. We get an increase of around 20 \% of our results, both on the accuracy of the test and its F1-score.
Abstract:In a context of malicious software detection, machine learning (ML) is widely used to generalize to new malware. However, it has been demonstrated that ML models can be fooled or may have generalization problems on malware that has never been seen. We investigate the possible benefits of quantum algorithms for classification tasks. We implement two models of Quantum Machine Learning algorithms, and we compare them to classical models for the classification of a dataset composed of malicious and benign executable files. We try to optimize our algorithms based on methods found in the literature, and analyze our results in an exploratory way, to identify the most interesting directions to explore for the future.