Associate Professor, Computational Data Science and Engineering Department, North Carolina A&T State University
Abstract:The sharing of large-scale transportation data is beneficial for transportation planning and policymaking. However, it also raises significant security and privacy concerns, as the data may include identifiable personal information, such as individuals' home locations. To address these concerns, synthetic data generation based on real transportation data offers a promising solution that allows privacy protection while potentially preserving data utility. Although there are various synthetic data generation techniques, they are often not tailored to the unique characteristics of transportation data, such as the inherent structure of transportation networks formed by all trips in the datasets. In this paper, we use New York City taxi data as a case study to conduct a systematic evaluation of the performance of widely used tabular data generative models. In addition to traditional metrics such as distribution similarity, coverage, and privacy preservation, we propose a novel graph-based metric tailored specifically for transportation data. This metric evaluates the similarity between real and synthetic transportation networks, providing potentially deeper insights into their structural and functional alignment. We also introduced an improved privacy metric to address the limitations of the commonly-used one. Our experimental results reveal that existing tabular data generative models often fail to perform as consistently as claimed in the literature, particularly when applied to transportation data use cases. Furthermore, our novel graph metric reveals a significant gap between synthetic and real data. This work underscores the potential need to develop generative models specifically tailored to take advantage of the unique characteristics of emerging domains, such as transportation.
Abstract:Deep learning-based discriminative classifiers, despite their remarkable success, remain vulnerable to adversarial examples that can mislead model predictions. While adversarial training can enhance robustness, it fails to address the intrinsic vulnerability stemming from the opaque nature of these black-box models. We present a deep ensemble model that combines discriminative features with generative models to achieve both high accuracy and adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions through a deep latent variable model. Using variational Bayes, our model achieves superior robustness against white-box adversarial attacks without adversarial training. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate our model's superior adversarial robustness. Through evaluations using counterfactual metrics and feature interaction-based metrics, we establish correlations between model interpretability and adversarial robustness. Additionally, preliminary results on Tiny-ImageNet validate our approach's scalability to more complex datasets, offering a practical solution for developing robust image classification models.
Abstract:As quantum computing continues to advance, the development of quantum-secure neural networks is crucial to prevent adversarial attacks. This paper proposes three quantum-secure design principles: (1) using post-quantum cryptography, (2) employing quantum-resistant neural network architectures, and (3) ensuring transparent and accountable development and deployment. These principles are supported by various quantum strategies, including quantum data anonymization, quantum-resistant neural networks, and quantum encryption. The paper also identifies open issues in quantum security, privacy, and trust, and recommends exploring adaptive adversarial attacks and auto adversarial attacks as future directions. The proposed design principles and recommendations provide guidance for developing quantum-secure neural networks, ensuring the integrity and reliability of machine learning models in the quantum era.
Abstract:The network of services, including delivery, farming, and environmental monitoring, has experienced exponential expansion in the past decade with Unmanned Aerial Vehicles (UAVs). Yet, UAVs are not robust enough against cyberattacks, especially on the Controller Area Network (CAN) bus. The CAN bus is a general-purpose vehicle-bus standard to enable microcontrollers and in-vehicle computers to interact, primarily connecting different Electronic Control Units (ECUs). In this study, we focus on solving some of the most critical security weaknesses in UAVs by developing a novel graph-based intrusion detection system (IDS) leveraging the Uncomplicated Application-level Vehicular Communication and Networking (UAVCAN) protocol. First, we decode CAN messages based on UAVCAN protocol specification; second, we present a comprehensive method of transforming tabular UAVCAN messages into graph structures. Lastly, we apply various graph-based machine learning models for detecting cyber-attacks on the CAN bus, including graph convolutional neural networks (GCNNs), graph attention networks (GATs), Graph Sample and Aggregate Networks (GraphSAGE), and graph structure-based transformers. Our findings show that inductive models such as GATs, GraphSAGE, and graph-based transformers can achieve competitive and even better accuracy than transductive models like GCNNs in detecting various types of intrusions, with minimum information on protocol specification, thus providing a generic robust solution for CAN bus security for the UAVs. We also compared our results with baseline single-layer Long Short-Term Memory (LSTM) and found that all our graph-based models perform better without using any decoded features based on the UAVCAN protocol, highlighting higher detection performance with protocol-independent capability.
Abstract:This study investigates crash severity risk modeling strategies for work zones involving large vehicles (i.e., trucks, buses, and vans) when there are crash data imbalance between low-severity (LS) and high-severity (HS) crashes. We utilized crash data, involving large vehicles in South Carolina work zones for the period between 2014 and 2018, which included 4 times more LS crashes compared to HS crashes. The objective of this study is to explore crash severity prediction performance of various models under different feature selection and data balancing techniques. The findings of this study highlight a disparity between LS and HS predictions, with less-accurate prediction of HS crashes compared to LS crashes due to class imbalance and feature overlaps between LS and HS crashes. Combining features from multiple feature selection techniques: statistical correlation, feature importance, recursive elimination, statistical tests, and mutual information, slightly improves HS crash prediction performance. Data balancing techniques such as NearMiss-1 and RandomUnderSampler, maximize HS recall when paired with certain prediction models, such as Bayesian Mixed Logit (BML), NeuralNet, and RandomForest, making them suitable for HS crash prediction. Conversely, RandomOverSampler, HS Class Weighting, and Kernel-based Synthetic Minority Oversampling (K-SMOTE), used with certain prediction models such as BML, CatBoost, and LightGBM, achieve a balanced performance, defined as achieving an equitable trade-off between LS and HS prediction performance metrics. These insights provide safety analysts with guidance to select models, feature selection techniques, and data balancing techniques that align with their specific safety objectives, offering a robust foundation for enhancing work-zone crash severity prediction.
Abstract:In this paper, we present an automated machine learning (AutoML) approach for network intrusion detection, leveraging a stacked ensemble model developed using the MLJAR AutoML framework. Our methodology combines multiple machine learning algorithms, including LightGBM, CatBoost, and XGBoost, to enhance detection accuracy and robustness. By automating model selection, feature engineering, and hyperparameter tuning, our approach reduces the manual overhead typically associated with traditional machine learning methods. Extensive experimentation on the NSL-KDD dataset demonstrates that the stacked ensemble model outperforms individual models, achieving high accuracy and minimizing false positives. Our findings underscore the benefits of using AutoML for network intrusion detection, as the AutoML-driven stacked ensemble achieved the highest performance with 90\% accuracy and an 89\% F1 score, outperforming individual models like Random Forest (78\% accuracy, 78\% F1 score), XGBoost and CatBoost (both 80\% accuracy, 80\% F1 score), and LightGBM (78\% accuracy, 78\% F1 score), providing a more adaptable and efficient solution for network security applications.
Abstract:The environmental impacts of global warming driven by methane (CH4) emissions have catalyzed significant research initiatives in developing novel technologies that enable proactive and rapid detection of CH4. Several data-driven machine learning (ML) models were tested to determine how well they identified fugitive CH4 and its related intensity in the affected areas. Various meteorological characteristics, including wind speed, temperature, pressure, relative humidity, water vapor, and heat flux, were included in the simulation. We used the ensemble learning method to determine the best-performing weighted ensemble ML models built upon several weaker lower-layer ML models to (i) detect the presence of CH4 as a classification problem and (ii) predict the intensity of CH4 as a regression problem.
Abstract:Wireless connections are a communication channel used to support different applications in our life such as microwave connections, mobile cellular networks, and intelligent transportation systems. The wireless communication channels are affected by different weather factors such as rain, snow, fog, dust, and sand. This effect is more evident in the high frequencies of the millimeter-wave (mm-wave) band. Recently, the 5G opened the door to support different applications with high speed and good quality. A recent study investigates the effect of rain and snow on the 5G communication channel to reduce the challenge of using high millimeter-wave frequencies. This research investigates the impact of dust and sand on the communication channel of 5G mini links using Mie scattering model to estimate the propagating wave's attenuation by computing the free space loss of a dusty region. Also, the cross-polarization of the propagating wave with dust and sand is taken into account at different distances of the propagating length. Two kinds of mini links, ML-6363, and ML-6352, are considered to demonstrate the effect of dust and sand in these specific operating frequency bands. The 73.5 GHz (V-band) and (21.5GHz (K-band) are the ML-6352 and ML-6363 radio frequency, respectively. Also, signal depolarization is another important radio frequency transmission parameter that is considered heroin. The numerical and simulation results show that the 5G ML-6352 is more effect by dust and sand than ML6363. The 5G toolbox is used to build the communication system and simulate the effect of the dust and sand on the different frequency bands.
Abstract:This paper presents Bayesian parameter estimation for first order Grey system models' parameters (or sometimes referred to as hyperparameters). There are different forms of first-order Grey System Models. These include $GM(1,1)$, $GM(1,1| \cos(\omega t)$, $GM(1,1| \sin(\omega t)$, and $GM(1,1| \cos(\omega t), \sin(\omega t)$. The whitenization equation of these models is a first-order linear differential equation of the form \[ \frac{dx}{dt} + a x = f(t) \] where $a$ is a parameter and $f(t) = b$ in $GM(1,1|)$ , $f(t) = b_1\cos(\omega t) + b_2$ in $GM(1,1| cos(\omega t)$, $f(t) = b_1\sin(\omega t)+b_2$ in $GM(1,1| \sin(\omega t)$, $f(t) = b_1\sin(\omega t) + b_2\cos(\omega t) + b_3$ in $GM(1,1| \cos(\omega t), \sin(\omega t)$, $f(t) = b x^2$ in Grey Verhulst model (GVM), and where $b, b_1, b_2$, and $b_3$ are parameters. The results from Bayesian estimations are compared to the least square estimated models with fixed $\omega$. We found that using rolling Bayesian estimations for GM parameters can allow us to estimate the parameters in all possible forms. Based on the data used for the comparison, the numerical results showed that models with Bayesian parameter estimations are up to 45\% more accurate in mean squared errors.
Abstract:The efficiency and reliability of real-time incident detection models directly impact the affected corridors' traffic safety and operational conditions. The recent emergence of cloud-based quantum computing infrastructure and innovations in noisy intermediate-scale quantum devices have revealed a new era of quantum-enhanced algorithms that can be leveraged to improve real-time incident detection accuracy. In this research, a hybrid machine learning model, which includes classical and quantum machine learning (ML) models, is developed to identify incidents using the connected vehicle (CV) data. The incident detection performance of the hybrid model is evaluated against baseline classical ML models. The framework is evaluated using data from a microsimulation tool for different incident scenarios. The results indicate that a hybrid neural network containing a 4-qubit quantum layer outperforms all other baseline models when there is a lack of training data. We have created three datasets; DS-1 with sufficient training data, and DS-2 and DS-3 with insufficient training data. The hybrid model achieves a recall of 98.9%, 98.3%, and 96.6% for DS-1, DS-2, and DS-3, respectively. For DS-2 and DS-3, the average improvement in F2-score (measures model's performance to correctly identify incidents) achieved by the hybrid model is 1.9% and 7.8%, respectively, compared to the classical models. It shows that with insufficient data, which may be common for CVs, the hybrid ML model will perform better than the classical models. With the continuing improvements of quantum computing infrastructure, the quantum ML models could be a promising alternative for CV-related applications when the available data is insufficient.