Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cheng-Lin Yang

CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research

Nov 02, 2024

Sian-Yao Huang, Cheng-Lin Yang, Che-Yu Lin, Chun-Ying Huang

Abstract:This research addresses command-line embedding in cybersecurity, a field obstructed by the lack of comprehensive datasets due to privacy and regulation concerns. We propose the first dataset of similar command lines, named CyPHER, for training and unbiased evaluation. The training set is generated using a set of large language models (LLMs) comprising 28,520 similar command-line pairs. Our testing dataset consists of 2,807 similar command-line pairs sourced from authentic command-line data. In addition, we propose a command-line embedding model named CmdCaliper, enabling the computation of semantic similarity with command lines. Performance evaluations demonstrate that the smallest version of CmdCaliper (30 million parameters) suppresses state-of-the-art (SOTA) sentence embedding models with ten times more parameters across various tasks (e.g., malicious command-line detection and similar command-line retrieval). Our study explores the feasibility of data generation using LLMs in the cybersecurity domain. Furthermore, we release our proposed command-line dataset, embedding models' weights and all program codes to the public. This advancement paves the way for more effective command-line embedding for future researchers.

Via

Access Paper or Ask Questions

With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization

Jan 28, 2022

James Wang, Cheng-Lin Yang

Figure 1 for With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization

Figure 2 for With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization

Figure 3 for With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization

Figure 4 for With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization

Abstract:Generalization of deep neural networks remains one of the main open problems in machine learning. Previous theoretical works focused on deriving tight bounds of model complexity, while empirical works revealed that neural networks exhibit double descent with respect to both training sample counts and the neural network size. In this paper, we empirically examined how different layers of neural networks contribute differently to the model; we found that early layers generally learn representations relevant to performance on both training data and testing data. Contrarily, deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data. We further illustrate the distance of trained weights to its initial value of final layers has high correlation to generalization errors and can serve as an indicator of an overfit of model. Moreover, we show evidence to support post-training regularization by re-initializing weights of final layers. Our findings provide an efficient method to estimate the generalization capability of neural networks, and the insight of those quantitative results may inspire derivation to better generalization bounds that take the internal structure of neural networks into consideration.

Via

Access Paper or Ask Questions