Abstract:Natural language processing (NLP) has grown significantly since the advent of the Transformer architecture. Transformers have given birth to pre-trained large language models (PLMs). There has been tremendous improvement in the performance of NLP systems across several tasks. NLP systems are on par or, in some cases, better than humans at accomplishing specific tasks. However, it remains the norm that \emph{better quality datasets at the time of pretraining enable PLMs to achieve better performance, regardless of the task.} The need to have quality datasets has prompted NLP researchers to continue creating new datasets to satisfy particular needs. For example, the two top NLP conferences, ACL and EMNLP, accepted ninety-two papers in 2022, introducing new datasets. This work aims to uncover the trends and insights mined within these datasets. Moreover, we provide valuable suggestions to researchers interested in curating datasets in the future.
Abstract:With the rapidly spreading usage of Internet of Things (IoT) devices, a network intrusion detection system (NIDS) plays an important role in detecting and protecting various types of attacks in the IoT network. To evaluate the robustness of the NIDS in the IoT network, the existing work proposed a realistic botnet dataset in the IoT network (Bot-IoT dataset) and applied it to machine learning-based anomaly detection. This dataset contains imbalanced normal and attack packets because the number of normal packets is much smaller than that of attack ones. The nature of imbalanced data may make it difficult to identify the minority class correctly. In this thesis, to address the class imbalance problem in the Bot-IoT dataset, we propose a binary classification method with synthetic minority over-sampling techniques (SMOTE). The proposed classifier aims to detect attack packets and overcome the class imbalance problem using the SMOTE algorithm. Through numerical results, we demonstrate the proposed classifier's fundamental characteristics and the impact of imbalanced data on its performance.
Abstract:Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.