Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ping He

On the Security Risks of ML-based Malware Detection Systems: A Survey

May 16, 2025

Ping He, Yuhao Mao, Changjiang Li, Lorenzo Cavallaro, Ting Wang, Shouling Ji

Abstract:Malware presents a persistent threat to user privacy and data integrity. To combat this, machine learning-based (ML-based) malware detection (MD) systems have been developed. However, these systems have increasingly been attacked in recent years, undermining their effectiveness in practice. While the security risks associated with ML-based MD systems have garnered considerable attention, the majority of prior works is limited to adversarial malware examples, lacking a comprehensive analysis of practical security risks. This paper addresses this gap by utilizing the CIA principles to define the scope of security risks. We then deconstruct ML-based MD systems into distinct operational stages, thus developing a stage-based taxonomy. Utilizing this taxonomy, we summarize the technical progress and discuss the gaps in the attack and defense proposals related to the ML-based MD systems within each stage. Subsequently, we conduct two case studies, using both inter-stage and intra-stage analyses according to the stage-based taxonomy to provide new empirical insights. Based on these analyses and insights, we suggest potential future directions from both inter-stage and intra-stage perspectives.

Via

Access Paper or Ask Questions

Defending against Adversarial Malware Attacks on ML-based Android Malware Detection Systems

Jan 23, 2025

Ping He, Lorenzo Cavallaro, Shouling Ji

Abstract:Android malware presents a persistent threat to users' privacy and data integrity. To combat this, researchers have proposed machine learning-based (ML-based) Android malware detection (AMD) systems. However, adversarial Android malware attacks compromise the detection integrity of the ML-based AMD systems, raising significant concerns. Existing defenses against adversarial Android malware provide protections against feature space attacks which generate adversarial feature vectors only, leaving protection against realistic threats from problem space attacks which generate real adversarial malware an open problem. In this paper, we address this gap by proposing ADD, a practical adversarial Android malware defense framework designed as a plug-in to enhance the adversarial robustness of the ML-based AMD systems against problem space attacks. Our extensive evaluation across various ML-based AMD systems demonstrates that ADD is effective against state-of-the-art problem space adversarial Android malware attacks. Additionally, ADD shows the defense effectiveness in enhancing the adversarial robustness of real-world antivirus solutions.

Via

Access Paper or Ask Questions

Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents

Nov 14, 2024

Yuyou Gan, Yong Yang, Zhe Ma, Ping He, Rui Zeng, Yiming Wang, Qingming Li, Chunyi Zhou, Songze Li, Ting Wang(+3 more)

Abstract:With the continuous development of large language models (LLMs), transformer-based models have made groundbreaking advances in numerous natural language processing (NLP) tasks, leading to the emergence of a series of agents that use LLMs as their control hub. While LLMs have achieved success in various tasks, they face numerous security and privacy threats, which become even more severe in the agent scenarios. To enhance the reliability of LLM-based applications, a range of research has emerged to assess and mitigate these risks from different perspectives. To help researchers gain a comprehensive understanding of various risks, this survey collects and analyzes the different threats faced by these agents. To address the challenges posed by previous taxonomies in handling cross-module and cross-stage threats, we propose a novel taxonomy framework based on the sources and impacts. Additionally, we identify six key features of LLM-based agents, based on which we summarize the current research progress and analyze their limitations. Subsequently, we select four representative agents as case studies to analyze the risks they may face in practical use. Finally, based on the aforementioned analyses, we propose future research directions from the perspectives of data, methodology, and policy, respectively.

Via

Access Paper or Ask Questions

Efficient Query-Based Attack against ML-Based Android Malware Detection under Zero Knowledge Setting

Sep 06, 2023

Ping He, Yifan Xia, Xuhong Zhang, Shouling Ji

Abstract:The widespread adoption of the Android operating system has made malicious Android applications an appealing target for attackers. Machine learning-based (ML-based) Android malware detection (AMD) methods are crucial in addressing this problem; however, their vulnerability to adversarial examples raises concerns. Current attacks against ML-based AMD methods demonstrate remarkable performance but rely on strong assumptions that may not be realistic in real-world scenarios, e.g., the knowledge requirements about feature space, model parameters, and training dataset. To address this limitation, we introduce AdvDroidZero, an efficient query-based attack framework against ML-based AMD methods that operates under the zero knowledge setting. Our extensive evaluation shows that AdvDroidZero is effective against various mainstream ML-based AMD methods, in particular, state-of-the-art such methods and real-world antivirus solutions.

* To Appear in the ACM Conference on Computer and Communications Security, November, 2023

Via

Access Paper or Ask Questions

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Aug 23, 2023

Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal

Abstract:We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 16 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).

Via

Access Paper or Ask Questions

Analysis and classification of main risk factors causing stroke in Shanxi Province

May 29, 2021

Junjie Liu, Yiyang Sun, Jing Ma, Jiachen Tu, Yuhui Deng, Ping He, Huaxiong Huang, Xiaoshuang Zhou, Shixin Xu

Figure 1 for Analysis and classification of main risk factors causing stroke in Shanxi Province

Figure 2 for Analysis and classification of main risk factors causing stroke in Shanxi Province

Figure 3 for Analysis and classification of main risk factors causing stroke in Shanxi Province

Figure 4 for Analysis and classification of main risk factors causing stroke in Shanxi Province

Abstract:In China, stroke is the first leading cause of death in recent years. It is a major cause of long-term physical and cognitive impairment, which bring great pressure on the National Public Health System. Evaluation of the risk of getting stroke is important for the prevention and treatment of stroke in China. A data set with 2000 hospitalized stroke patients in 2018 and 27583 residents during the year 2017 to 2020 is analyzed in this study. Due to data incompleteness, inconsistency, and non-structured formats, missing values in the raw data are filled with -1 as an abnormal class. With the cleaned features, three models on risk levels of getting stroke are built by using machine learning methods. The importance of "8+2" factors from China National Stroke Prevention Project (CSPP) is evaluated via decision tree and random forest models. Except for "8+2" factors the importance of features and SHAP1 values for lifestyle information, demographic information, and medical measurement are evaluated and ranked via a random forest model. Furthermore, a logistic regression model is applied to evaluate the probability of getting stroke for different risk levels. Based on the census data in both communities and hospitals from Shanxi Province, we investigate different risk factors of getting stroke and their ranking with interpretable machine learning models. The results show that Hypertension (Systolic blood pressure, Diastolic blood pressure), Physical Inactivity (Lack of sports), and Overweight (BMI) are ranked as the top three high-risk factors of getting stroke in Shanxi province. The probability of getting stroke for a person can also be predicted via our machine learning model.

* 13 pages, 9 figures

Via

Access Paper or Ask Questions

Enriching Medcial Terminology Knowledge Bases via Pre-trained Language Model and Graph Convolutional Network

Sep 02, 2019

Jiaying Zhang, Zhixing Zhang, Huanhuan Zhang, Zhiyuan Ma, Yangming Zhou, Ping He

Figure 1 for Enriching Medcial Terminology Knowledge Bases via Pre-trained Language Model and Graph Convolutional Network

Figure 2 for Enriching Medcial Terminology Knowledge Bases via Pre-trained Language Model and Graph Convolutional Network

Figure 3 for Enriching Medcial Terminology Knowledge Bases via Pre-trained Language Model and Graph Convolutional Network

Figure 4 for Enriching Medcial Terminology Knowledge Bases via Pre-trained Language Model and Graph Convolutional Network

Abstract:Enriching existing medical terminology knowledge bases (KBs) is an important and never-ending work for clinical research because new terminology alias may be continually added and standard terminologies may be newly renamed. In this paper, we propose a novel automatic terminology enriching approach to supplement a set of terminologies to KBs. Specifically, terminology and entity characters are first fed into pre-trained language model to obtain semantic embedding. The pre-trained model is used again to initialize the terminology and entity representations, then they are further embedded through graph convolutional network to gain structure embedding. Afterwards, both semantic and structure embeddings are combined to measure the relevancy between the terminology and the entity. Finally, the optimal alignment is achieved based on the order of relevancy between the terminology and all the entities in the KB. Experimental results on clinical indicator terminology KB, collected from 38 top-class hospitals of Shanghai Hospital Development Center, show that our proposed approach outperforms baseline methods and can effectively enrich the KB.

* 8 pages, submitted to BIBM 2019

Via

Access Paper or Ask Questions

Active Learning for Chinese Word Segmentation in Medical Text

Aug 22, 2019

Tingting Cai, Yangming Zhou, Zhiyuan Ma, Hong Zheng, Lingfei Zhang, Ping He, Ju Gao

Figure 1 for Active Learning for Chinese Word Segmentation in Medical Text

Figure 2 for Active Learning for Chinese Word Segmentation in Medical Text

Figure 3 for Active Learning for Chinese Word Segmentation in Medical Text

Figure 4 for Active Learning for Chinese Word Segmentation in Medical Text

Abstract:Electronic health records (EHRs) stored in hospital information systems completely reflect the patients' diagnosis and treatment processes, which are essential to clinical data mining. Chinese word segmentation (CWS) is a fundamental and important task for Chinese natural language processing. Currently, most state-of-the-art CWS methods greatly depend on large-scale manually-annotated data, which is a very time-consuming and expensive work, specially for the annotation in medical field. In this paper, we present an active learning method for CWS in medical text. To effectively utilize complete segmentation history, a new scoring model in sampling strategy is proposed, which combines information entropy with neural network. Besides, to capture interactions between adjacent characters, K-means clustering features are additionally added in word segmenter. We experimentally evaluate our proposed CWS method in medical text, experimental results based on EHRs collected from the Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine show that our proposed method outperforms other reference methods, which can effectively save the cost of manual annotation.

* 8 pages, 7 figures, submitted to BIBM 2019

Via

Access Paper or Ask Questions

Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text

Aug 21, 2019

Kui Xue, Yangming Zhou, Zhiyuan Ma, Tong Ruan, Huanhuan Zhang, Ping He

Figure 1 for Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text

Figure 2 for Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text

Figure 3 for Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text

Figure 4 for Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text

Abstract:Entity and relation extraction is the necessary step in structuring medical text. However, the feature extraction ability of the bidirectional long short term memory network in the existing model does not achieve the best effect. At the same time, the language model has achieved excellent results in more and more natural language processing tasks. In this paper, we present a focused attention model for the joint entity and relation extraction task. Our model integrates well-known BERT language model into joint learning through dynamic range attention mechanism, thus improving the feature representation ability of shared parameter layer. Experimental results on coronary angiography texts collected from Shuguang Hospital show that the F1-score of named entity recognition and relation classification tasks reach 96.89% and 88.51%, which are better than state-of-the-art methods 1.65% and 1.22%, respectively.

* 8 pages, 2 figures, submitted to BIBM 2019

Via

Access Paper or Ask Questions

An attention-based Bi-GRU-CapsNet model for hypernymy detection between compound entities

May 18, 2018

Qi Wang, Tong Ruan, Yangming Zhou, Chenming Xu, Daqi Gao, Ping He

Figure 1 for An attention-based Bi-GRU-CapsNet model for hypernymy detection between compound entities

Figure 2 for An attention-based Bi-GRU-CapsNet model for hypernymy detection between compound entities

Figure 3 for An attention-based Bi-GRU-CapsNet model for hypernymy detection between compound entities

Figure 4 for An attention-based Bi-GRU-CapsNet model for hypernymy detection between compound entities

Abstract:Named entities which composed of multiple continuous words frequently occur in domain-specific knowledge graphs. These entities are usually composable and extensible. Typical examples are names of symptoms and diseases in medical areas. To distinguish these entities from general entities, we name them compound entities. Hypernymy detection between compound entities plays an important role in domain-specific knowledge graph construction. Traditional hypernymy detection approaches cannot perform well on compound entities due to the lack of contextual information in texts, and even the absence of compound entities in training sets, i.e. Out-Of-Vocabulary (OOV) problem. In this paper, we present a novel attention-based Bi-GRU-CapsNet model to detect hypernymy relationship between compound entities. Our model consists of several important components. To avoid the OOV problem, English words or Chinese characters in compound entities are fed into Bidirectional Gated Recurrent Units (Bi-GRUs). An attention mechanism is designed to focus on the differences between two compound entities. Since there are some different cases in hypernymy relationship between compound entities, Capsule Network (CapsNet) is finally employed to decide whether the hypernymy relationship exists or not. Experimental results demonstrate the advantages of our model over the state-of-the-art methods both on English and Chinese corpora of symptom and disease pairs.

* 21 pages, 3 figures. This is an improved version of our paper originally submitted to conference IJCAI-2018

Via

Access Paper or Ask Questions