Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tony Quertier

Semantic Preprocessing for LLM-based Malware Analysis

Jun 13, 2025

Benjamin Marais, Tony Quertier, Grégoire Barrue

Abstract:In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset, representative of market reality.

Via

Access Paper or Ask Questions

Benchmarking data encoding methods in Quantum Machine Learning

May 20, 2025

Orlane Zang, Grégoire Barrué, Tony Quertier

Abstract:Data encoding plays a fundamental and distinctive role in Quantum Machine Learning (QML). While classical approaches process data directly as vectors, QML may require transforming classical data into quantum states through encoding circuits, known as quantum feature maps or quantum embeddings. This step leverages the inherently high-dimensional and non-linear nature of Hilbert space, enabling more efficient data separation in complex feature spaces that may be inaccessible to classical methods. This encoding part significantly affects the performance of the QML model, so it is important to choose the right encoding method for the dataset to be encoded. However, this choice is generally arbitrary, since there is no "universal" rule for knowing which encoding to choose based on a specific set of data. There are currently a variety of encoding methods using different quantum logic gates. We studied the most commonly used types of encoding methods and benchmarked them using different datasets.

* 30 pages, 8 figures

Via

Access Paper or Ask Questions

A Lean Transformer Model for Dynamic Malware Analysis and Detection

Aug 05, 2024

Tony Quertier, Benjamin Marais, Grégoire Barrué, Stéphane Morucci, Sévan Azé, Sébastien Salladin

Figure 1 for A Lean Transformer Model for Dynamic Malware Analysis and Detection

Figure 2 for A Lean Transformer Model for Dynamic Malware Analysis and Detection

Figure 3 for A Lean Transformer Model for Dynamic Malware Analysis and Detection

Figure 4 for A Lean Transformer Model for Dynamic Malware Analysis and Detection

Abstract:Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue. This is mainly due to the fact that many prevention solutions rely on signature-based detection methods that can easily be circumvented by hackers. Therefore, there is a recurrent need for behavior-based analysis where a suspicious file is ran in a secured environment and its traces are collected to reports for analysis. Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from these execution reports. Recently, Large Language Models and Generative AI have demonstrated impressive capabilities mainly in Natural Language Processing tasks and promising applications in the cybersecurity field for both attackers and defenders. In this paper, we design an Encoder-Only model, based on the Transformers architecture, to detect malicious files, digesting their API call sequences collected by an execution emulation solution. We are also limiting the size of the model architecture and the number of its parameters since it is often considered that Large Language Models may be overkill for specific tasks such as the one we are dealing with hereafter. In addition to achieving decent detection results, this approach has the advantage of reducing our carbon footprint by limiting training and inference times and facilitating technical operations with less hardware requirements. We also carry out some analysis of our results and highlight the limits and possible improvements when using Transformers to analyze malicious files.

Via

Access Paper or Ask Questions

Use of Multi-CNNs for Section Analysis in Static Malware Detection

Feb 06, 2024

Tony Quertier, Grégoire Barrué

Abstract:Existing research on malware detection focuses almost exclusively on the detection rate. However, in some cases, it is also important to understand the results of our algorithm, or to obtain more information, such as where to investigate in the file for an analyst. In this aim, we propose a new model to analyze Portable Executable files. Our method consists in splitting the files in different sections, then transform each section into an image, in order to train convolutional neural networks to treat specifically each identified section. Then we use all these scores returned by CNNs to compute a final detection score, using models that enable us to improve our analysis of the importance of each section in the final score.

* arXiv admin note: text overlap with arXiv:2312.12161

Via

Access Paper or Ask Questions

Towards an in-depth detection of malware using distributed QCNN

Dec 19, 2023

Tony Quertier, Grégoire Barrué

Abstract:Malware detection is an important topic of current cybersecurity, and Machine Learning appears to be one of the main considered solutions even if certain problems to generalize to new malware remain. In the aim of exploring the potential of quantum machine learning on this domain, our previous work showed that quantum neural networks do not perform well on image-based malware detection when using a few qubits. In order to enhance the performances of our quantum algorithms for malware detection using images, without increasing the resources needed in terms of qubits, we implement a new preprocessing of our dataset using Grayscale method, and we couple it with a model composed of five distributed quantum convolutional networks and a scoring function. We get an increase of around 20 \% of our results, both on the accuracy of the test and its F1-score.

Via

Access Paper or Ask Questions

Quantum Machine Learning for Malware Classification

May 22, 2023

Grégoire Barrué, Tony Quertier

Abstract:In a context of malicious software detection, machine learning (ML) is widely used to generalize to new malware. However, it has been demonstrated that ML models can be fooled or may have generalization problems on malware that has never been seen. We investigate the possible benefits of quantum algorithms for classification tasks. We implement two models of Quantum Machine Learning algorithms, and we compare them to classical models for the classification of a dataset composed of malicious and benign executable files. We try to optimize our algorithms based on methods found in the literature, and analyze our results in an exploratory way, to identify the most interesting directions to explore for the future.

Via

Access Paper or Ask Questions

Malware and Ransomware Detection Models

Jul 05, 2022

Benjamin Marais, Tony Quertier, Stéphane Morucci

Figure 1 for Malware and Ransomware Detection Models

Figure 2 for Malware and Ransomware Detection Models

Figure 3 for Malware and Ransomware Detection Models

Figure 4 for Malware and Ransomware Detection Models

Abstract:Cybercrime is one of the major digital threats of this century. In particular, ransomware attacks have significantly increased, resulting in global damage costs of tens of billion dollars. In this paper, we train and test different Machine Learning and Deep Learning models for malware detection, malware classification and ransomware detection. We introduce a novel and flexible ransomware detection model that combines two optimized models. Our detection results on a limited dataset demonstrate good accuracy and F1 scores.

Via

Access Paper or Ask Questions

MERLIN -- Malware Evasion with Reinforcement LearnINg

Mar 30, 2022

Tony Quertier, Benjamin Marais, Stéphane Morucci, Bertrand Fournel

Figure 1 for MERLIN -- Malware Evasion with Reinforcement LearnINg

Figure 2 for MERLIN -- Malware Evasion with Reinforcement LearnINg

Figure 3 for MERLIN -- Malware Evasion with Reinforcement LearnINg

Figure 4 for MERLIN -- Malware Evasion with Reinforcement LearnINg

Abstract:In addition to signature-based and heuristics-based detection techniques, machine learning (ML) is widely used to generalize to new, never-before-seen malicious software (malware). However, it has been demonstrated that ML models can be fooled by tricking the classifier into returning the incorrect label. These studies, for instance, usually rely on a prediction score that is fragile to gradient-based attacks. In the context of a more realistic situation where an attacker has very little information about the outputs of a malware detection engine, modest evasion rates are achieved. In this paper, we propose a method using reinforcement learning with DQN and REINFORCE algorithms to challenge two state-of-the-art ML-based detection engines (MalConv \& EMBER) and a commercial AV classified by Gartner as a leader AV. Our method combines several actions, modifying a Windows portable execution (PE) file without breaking its functionalities. Our method also identifies which actions perform better and compiles a detailed vulnerability report to help mitigate the evasion. We demonstrate that REINFORCE achieves very good evasion rates even on a commercial AV with limited available information.

Via

Access Paper or Ask Questions

Malware Analysis with Artificial Intelligence and a Particular Attention on Results Interpretability

Jul 23, 2021

Benjamin Marais, Tony Quertier, Christophe Chesneau

Figure 1 for Malware Analysis with Artificial Intelligence and a Particular Attention on Results Interpretability

Figure 2 for Malware Analysis with Artificial Intelligence and a Particular Attention on Results Interpretability

Figure 3 for Malware Analysis with Artificial Intelligence and a Particular Attention on Results Interpretability

Figure 4 for Malware Analysis with Artificial Intelligence and a Particular Attention on Results Interpretability

Abstract:Malware detection and analysis are active research subjects in cybersecurity over the last years. Indeed, the development of obfuscation techniques, as packing, for example, requires special attention to detect recent variants of malware. The usual detection methods do not necessarily provide tools to interpret the results. Therefore, we propose a model based on the transformation of binary files into grayscale image, which achieves an accuracy rate of 88%. Furthermore, the proposed model can determine if a sample is packed or encrypted with a precision of 85%. It allows us to analyze results and act appropriately. Also, by applying attention mechanisms on detection models, we have the possibility to identify which part of the files looks suspicious. This kind of tool should be very useful for data analysts, it compensates for the lack of interpretability of the common detection models, and it can help to understand why some malicious files are undetected.

Via

Access Paper or Ask Questions