Oak Ridge National Laboratory
Abstract:Vehicular controller area networks (CANs) are susceptible to masquerade attacks by malicious adversaries. In masquerade attacks, adversaries silence a targeted ID and then send malicious frames with forged content at the expected timing of benign frames. As masquerade attacks could seriously harm vehicle functionality and are the stealthiest attacks to detect in CAN, recent work has devoted attention to compare frameworks for detecting masquerade attacks in CAN. However, most existing works report offline evaluations using CAN logs already collected using simulations that do not comply with domain's real-time constraints. Here we contribute to advance the state of the art by introducing a benchmark study of four different non-deep learning (DL)-based unsupervised online intrusion detection systems (IDS) for masquerade attacks in CAN. Our approach differs from existing benchmarks in that we analyze the effect of controlling streaming data conditions in a sliding window setting. In doing so, we use realistic masquerade attacks being replayed from the ROAD dataset. We show that although benchmarked IDS are not effective at detecting every attack type, the method that relies on detecting changes at the hierarchical structure of clusters of time series produces the best results at the expense of higher computational overhead. We discuss limitations, open challenges, and how the benchmarked methods can be used for practical unsupervised online CAN IDS for masquerade attacks.
Abstract:In this pioneering work we formulate ExpM+NF, a method for training machine learning (ML) on private data with pre-specified differentially privacy guarantee $\varepsilon>0, \delta=0$, by using the Exponential Mechanism (ExpM) and an auxiliary Normalizing Flow (NF). We articulate theoretical benefits of ExpM+NF over Differentially Private Stochastic Gradient Descent (DPSGD), the state-of-the-art (SOTA) and de facto method for differentially private ML, and we empirically test ExpM+NF against DPSGD using the SOTA implementation (Opacus with PRV accounting) in multiple classification tasks on the Adult Dataset (census data) and MIMIC-III Dataset (electronic healthcare records) using Logistic Regression and GRU-D, a deep learning recurrent neural network with ~20K-100K parameters. In all experiments, ExpM+NF achieves greater than 93% of the non-private training accuracy (AUC) for $\varepsilon \in [1\mathrm{e}{-3}, 1]$, exhibiting greater accuracy (higher AUC) and privacy (lower $\varepsilon$ with $\delta=0$) than DPSGD. Differentially private ML generally considers $\varepsilon \in [1,10]$ to maintain reasonable accuracy; hence, ExpM+NF's ability to provide strong accuracy for orders of magnitude better privacy (smaller $\varepsilon$) substantially pushes what is currently possible in differentially private ML. Training time results are presented showing ExpM+NF is comparable to (slightly faster) than DPSGD. Code for these experiments will be provided after review. Limitations and future directions are provided.
Abstract:Vehicular Controller Area Networks (CANs) are susceptible to cyber attacks of different levels of sophistication. Fabrication attacks are the easiest to administer -- an adversary simply sends (extra) frames on a CAN -- but also the easiest to detect because they disrupt frame frequency. To overcome time-based detection methods, adversaries must administer masquerade attacks by sending frames in lieu of (and therefore at the expected time of) benign frames but with malicious payloads. Research efforts have proven that CAN attacks, and masquerade attacks in particular, can affect vehicle functionality. Examples include causing unintended acceleration, deactivation of vehicle's brakes, as well as steering the vehicle. We hypothesize that masquerade attacks modify the nuanced correlations of CAN signal time series and how they cluster together. Therefore, changes in cluster assignments should indicate anomalous behavior. We confirm this hypothesis by leveraging our previously developed capability for reverse engineering CAN signals (i.e., CAN-D [Controller Area Network Decoder]) and focus on advancing the state of the art for detecting masquerade attacks by analyzing time series extracted from raw CAN frames. Specifically, we demonstrate that masquerade attacks can be detected by computing time series clustering similarity using hierarchical clustering on the vehicle's CAN signals (time series) and comparing the clustering similarity across CAN captures with and without attacks. We test our approach in a previously collected CAN dataset with masquerade attacks (i.e., the ROAD dataset) and develop a forensic tool as a proof of concept to demonstrate the potential of the proposed approach for detecting CAN masquerade attacks.
Abstract:Modern vehicles are complex cyber-physical systems made of hundreds of electronic control units (ECUs) that communicate over controller area networks (CANs). This inherited complexity has expanded the CAN attack surface which is vulnerable to message injection attacks. These injections change the overall timing characteristics of messages on the bus, and thus, to detect these malicious messages, time-based intrusion detection systems (IDSs) have been proposed. However, time-based IDSs are usually trained and tested on low-fidelity datasets with unrealistic, labeled attacks. This makes difficult the task of evaluating, comparing, and validating IDSs. Here we detail and benchmark four time-based IDSs against the newly published ROAD dataset, the first open CAN IDS dataset with real (non-simulated) stealthy attacks with physically verified effects. We found that methods that perform hypothesis testing by explicitly estimating message timing distributions have lower performance than methods that seek anomalies in a distribution-related statistic. In particular, these "distribution-agnostic" based methods outperform "distribution-based" methods by at least 55% in area under the precision-recall curve (AUC-PR). Our results expand the body of knowledge of CAN time-based IDSs by providing details of these methods and reporting their results when tested on datasets with real advanced attacks. Finally, we develop an after-market plug-in detector using lightweight hardware, which can be used to deploy the best performing IDS method on nearly any vehicle.
Abstract:The Controller Area Network (CAN) protocol is ubiquitous in modern vehicles, but the protocol lacks many important security properties, such as message authentication. To address these insecurities, a rapidly growing field of research has emerged that seeks to detect tampering, anomalies, or attacks on these networks; this field has developed a wide variety of novel approaches and algorithms to address these problems. One major impediment to the progression of this CAN anomaly detection and intrusion detection system (IDS) research area is the lack of high-fidelity datasets with realistic labeled attacks, without which it is difficult to evaluate, compare, and validate these proposed approaches. In this work we present the first comprehensive survey of publicly available CAN intrusion datasets. Based on a thorough analysis of the data and documentation, for each dataset we provide a detailed description and enumerate the drawbacks, benefits, and suggested use cases. Our analysis is aimed at guiding researchers in finding appropriate datasets for testing a CAN IDS. We present the Real ORNL Automotive Dynamometer (ROAD) CAN Intrusion Dataset, providing the first dataset with real, advanced attacks to the existing collection of open datasets.
Abstract:There is a lack of scientific testing of commercially available malware detectors, especially those that boast accurate classification of never-before-seen (zero-day) files using machine learning (ML). The result is that the efficacy and trade-offs among the different available approaches are opaque. In this paper, we address this gap in the scientific literature with an evaluation of commercially available malware detection tools. We tested each tool against 3,536 total files (2,554 72% malicious, 982 28% benign) including over 400 zero-day malware, and tested with a variety of file types and protocols for delivery. Specifically, we investigate three questions: Do ML-based malware detectors provide better detection than signature-based detectors? Is it worth purchasing a network-level malware detector to complement host-based detection? What is the trade-off in detection time and detection accuracy among commercially available tools using static and dynamic analysis? We present statistical results on detection time and accuracy, consider complementary analysis (using multiple tools together), and provide a novel application of a recent cost-benefit evaluation procedure by Iannaconne \& Bridges that incorporates all the above metrics into a single quantifiable cost to help security operation centers select the right tools for their use case. Our results show that while ML-based tools are more effective at detecting zero-days and malicious executables, they work best when used in combination with a signature-based solution. In addition, network-based tools had poor detection rates on protocols other than the HTTP or SMTP, making them a poor choice if used on their own. Surprisingly, we also found that all the tools tested had lower than expected detection rates, completely missing 37% of malicious files tested and failing to detect any polyglot files.
Abstract:We present an approach to analyze $C^1(\mathbb{R}^m)$ functions that addresses limitations present in the Active Subspaces (AS) method of Constantine et al.(2015; 2014). Under appropriate hypotheses, our Active Manifolds (AM) method identifies a 1-D curve in the domain (the active manifold) on which nearly all values of the unknown function are attained, and which can be exploited for approximation or analysis, especially when $m$ is large (high-dimensional input space). We provide theorems justifying our AM technique and an algorithm permitting functional approximation and sensitivity analysis. Using accessible, low-dimensional functions as initial examples, we show AM reduces approximation error by an order of magnitude compared to AS, at the expense of more computation. Following this, we revisit the sensitivity analysis by Glaws et al. (2017), who apply AS to analyze a magnetohydrodynamic power generator model, and compare the performance of AM on the same data. Our analysis provides detailed information not captured by AS, exhibiting the influence of each parameter individually along an active manifold. Overall, AM represents a novel technique for analyzing functional models with benefits including: reducing $m$-dimensional analysis to a 1-D analogue, permitting more accurate regression than AS (at more computational expense), enabling more informative sensitivity analysis, and granting accessible visualizations(2-D plots) of parameter sensitivity along the AM.
Abstract:Modern cyber security operations collect an enormous amount of logging and alerting data. While analysts have the ability to query and compute simple statistics and plots from their data, current analytical tools are too simple to admit deep understanding. To detect advanced and novel attacks, analysts turn to manual investigations. While commonplace, current investigations are time-consuming, intuition-based, and proving insufficient. Our hypothesis is that arming the analyst with easy-to-use data science tools will increase their work efficiency, provide them with the ability to resolve hypotheses with scientific inquiry of their data, and support their decisions with evidence over intuition. To this end, we present our work to build IDEAS (Interactive Data Exploration and Analysis System). We present three real-world use-cases that drive the system design from the algorithmic capabilities to the user interface. Finally, a modular and scalable software architecture is discussed along with plans for our pilot deployment with a security operation command.
Abstract:Scientists and engineers rely on accurate mathematical models to quantify the objects of their studies, which are often high-dimensional. Unfortunately, high-dimensional models are inherently difficult, i.e. when observations are sparse or expensive to determine. One way to address this problem is to approximate the original model with fewer input dimensions. Our project goal was to recover a function f that takes n inputs and returns one output, where n is potentially large. For any given n-tuple, we assume that we can observe a sample of the gradient and output of the function but it is computationally expensive to do so. This project was inspired by an approach known as Active Subspaces, which works by linearly projecting to a linear subspace where the function changes most on average. Our research gives mathematical developments informing a novel algorithm for this problem. Our approach, Active Manifolds, increases accuracy by seeking nonlinear analogues that approximate the function. The benefits of our approach are eliminated unprincipled parameter, choices, guaranteed accessible visualization, and improved estimation accuracy.
Abstract:This paper introduces a novel graph-analytic approach for detecting anomalies in network flow data called GraphPrints. Building on foundational network-mining techniques, our method represents time slices of traffic as a graph, then counts graphlets -- small induced subgraphs that describe local topology. By performing outlier detection on the sequence of graphlet counts, anomalous intervals of traffic are identified, and furthermore, individual IPs experiencing abnormal behavior are singled-out. Initial testing of GraphPrints is performed on real network data with an implanted anomaly. Evaluation shows false positive rates bounded by 2.84% at the time-interval level, and 0.05% at the IP-level with 100% true positive rates at both.