Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohammad Saiful Islam

Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset

Nov 13, 2024

Mohammad Saiful Islam, Mohamed Sami Rakha, William Pourmajidi, Janakan Sivaloganathan, John Steinbacher, Andriy Miranskyy

Figure 1 for Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset

Figure 2 for Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset

Figure 3 for Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset

Figure 4 for Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset

Abstract:As Large-Scale Cloud Systems (LCS) become increasingly complex, effective anomaly detection is critical for ensuring system reliability and performance. However, there is a shortage of large-scale, real-world datasets available for benchmarking anomaly detection methods. To address this gap, we introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console. This dataset comprises 39,365 rows and 117,448 columns of telemetry data. Additionally, we demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process. This study and the accompanying dataset provide a resource for researchers and practitioners in cloud system monitoring. It facilitates more efficient testing of anomaly detection methods in real-world data, helping to advance the development of robust solutions to maintain the health and performance of large-scale cloud infrastructures.

Via

Access Paper or Ask Questions

Anomaly Detection in a Large-scale Cloud Platform

Oct 21, 2020

Mohammad Saiful Islam, William Pourmajidi, Lei Zhang, John Steinbacher, Tony Erwin, Andriy Miranskyy

Figure 1 for Anomaly Detection in a Large-scale Cloud Platform

Figure 2 for Anomaly Detection in a Large-scale Cloud Platform

Figure 3 for Anomaly Detection in a Large-scale Cloud Platform

Figure 4 for Anomaly Detection in a Large-scale Cloud Platform

Abstract:Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud. However, this rise in popularity challenges Cloud service providers, as they need to monitor the quality of their ever-growing offerings effectively. To address the challenge, we designed and implemented an automated monitoring system for the IBM Cloud Platform. This monitoring system utilizes deep learning neural networks to detect anomalies in near-real-time in multiple Platform components simultaneously. After running the system for a year, we observed that the proposed solution frees the DevOps team's time and human resources from manually monitoring thousands of Cloud components. Moreover, it increases customer satisfaction by reducing the risk of Cloud outages. In this paper, we share our solutions' architecture, implementation notes, and best practices that emerged while evolving the monitoring system. They can be leveraged by other researchers and practitioners to build anomaly detectors for complex systems.

Via

Access Paper or Ask Questions

Anomaly Detection in Cloud Components

May 18, 2020

Mohammad Saiful Islam, Andriy Miranskyy

Figure 1 for Anomaly Detection in Cloud Components

Abstract:Cloud platforms, under the hood, consist of a complex inter-connected stack of hardware and software components. Each of these components can fail which may lead to an outage. Our goal is to improve the quality of Cloud services through early detection of such failures by analyzing resource utilization metrics. We tested Gated-Recurrent-Unit-based autoencoder with a likelihood function to detect anomalies in various multi-dimensional time series and achieved high performance.

* Accepted for publication in Proceedings of the IEEE International Conference on Cloud Computing (CLOUD 2020)

Via

Access Paper or Ask Questions