Abstract:Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue. This is mainly due to the fact that many prevention solutions rely on signature-based detection methods that can easily be circumvented by hackers. Therefore, there is a recurrent need for behavior-based analysis where a suspicious file is ran in a secured environment and its traces are collected to reports for analysis. Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from these execution reports. Recently, Large Language Models and Generative AI have demonstrated impressive capabilities mainly in Natural Language Processing tasks and promising applications in the cybersecurity field for both attackers and defenders. In this paper, we design an Encoder-Only model, based on the Transformers architecture, to detect malicious files, digesting their API call sequences collected by an execution emulation solution. We are also limiting the size of the model architecture and the number of its parameters since it is often considered that Large Language Models may be overkill for specific tasks such as the one we are dealing with hereafter. In addition to achieving decent detection results, this approach has the advantage of reducing our carbon footprint by limiting training and inference times and facilitating technical operations with less hardware requirements. We also carry out some analysis of our results and highlight the limits and possible improvements when using Transformers to analyze malicious files.
Abstract:Despite the promising results of machine learning models in malware detection, they face the problem of concept drift due to malware constant evolution. This leads to a decline in performance over time, as the data distribution of the new files differs from the training one, requiring regular model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network to handle with the drift problem. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset (2018) and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.
Abstract:Cybercrime is one of the major digital threats of this century. In particular, ransomware attacks have significantly increased, resulting in global damage costs of tens of billion dollars. In this paper, we train and test different Machine Learning and Deep Learning models for malware detection, malware classification and ransomware detection. We introduce a novel and flexible ransomware detection model that combines two optimized models. Our detection results on a limited dataset demonstrate good accuracy and F1 scores.
Abstract:In addition to signature-based and heuristics-based detection techniques, machine learning (ML) is widely used to generalize to new, never-before-seen malicious software (malware). However, it has been demonstrated that ML models can be fooled by tricking the classifier into returning the incorrect label. These studies, for instance, usually rely on a prediction score that is fragile to gradient-based attacks. In the context of a more realistic situation where an attacker has very little information about the outputs of a malware detection engine, modest evasion rates are achieved. In this paper, we propose a method using reinforcement learning with DQN and REINFORCE algorithms to challenge two state-of-the-art ML-based detection engines (MalConv \& EMBER) and a commercial AV classified by Gartner as a leader AV. Our method combines several actions, modifying a Windows portable execution (PE) file without breaking its functionalities. Our method also identifies which actions perform better and compiles a detailed vulnerability report to help mitigate the evasion. We demonstrate that REINFORCE achieves very good evasion rates even on a commercial AV with limited available information.
Abstract:Malware detection and analysis are active research subjects in cybersecurity over the last years. Indeed, the development of obfuscation techniques, as packing, for example, requires special attention to detect recent variants of malware. The usual detection methods do not necessarily provide tools to interpret the results. Therefore, we propose a model based on the transformation of binary files into grayscale image, which achieves an accuracy rate of 88%. Furthermore, the proposed model can determine if a sample is packed or encrypted with a precision of 85%. It allows us to analyze results and act appropriately. Also, by applying attention mechanisms on detection models, we have the possibility to identify which part of the files looks suspicious. This kind of tool should be very useful for data analysts, it compensates for the lack of interpretability of the common detection models, and it can help to understand why some malicious files are undetected.