Abstract:Recently, the field of machine learning has undergone a transition from model-centric to data-centric. The advancements in diverse learning tasks have been propelled by the accumulation of more extensive datasets, subsequently facilitating the training of larger models on these datasets. However, these datasets remain relatively under-explored. To this end, we introduce a pioneering approach known as RK-core, to empower gaining a deeper understanding of the intricate hierarchical structure within datasets. Across several benchmark datasets, we find that samples with low coreness values appear less representative of their respective categories, and conversely, those with high coreness values exhibit greater representativeness. Correspondingly, samples with high coreness values make a more substantial contribution to the performance in comparison to those with low coreness values. Building upon this, we further employ RK-core to analyze the hierarchical structure of samples with different coreset selection methods. Remarkably, we find that a high-quality coreset should exhibit hierarchical diversity instead of solely opting for representative samples. The code is available at https://github.com/yaolu-zjut/Kcore.
Abstract:As the default protocol for exchanging routing reachability information on the Internet, the abnormal behavior in traffic of Border Gateway Protocols (BGP) is closely related to Internet anomaly events. The BGP anomalous detection model ensures stable routing services on the Internet through its real-time monitoring and alerting capabilities. Previous studies either focused on the feature selection problem or the memory characteristic in data, while ignoring the relationship between features and the precise time correlation in feature (whether it's long or short term dependence). In this paper, we propose a multi-view model for capturing anomalous behaviors from BGP update traffic, in which Seasonal and Trend decomposition using Loess (STL) method is used to reduce the noise in the original time-series data, and Graph Attention Network (GAT) is used to discover feature relationships and time correlations in feature, respectively. Our results outperform the state-of-the-art methods at the anomaly detection task, with the average F1 score up to 96.3% and 93.2% on the balanced and imbalanced datasets respectively. Meanwhile, our model can be extended to classify multiple anomalous and to detect unknown events.