Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Jan 10, 2023

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, Saravan Rajmohan

Figure 1 for Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Figure 2 for Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Figure 3 for Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Figure 4 for Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Share this with someone who'll enjoy it:

Abstract:Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.

* Accepted at International Conference on Software Engineering (ICSE-2023)

View paper on

Share this with someone who'll enjoy it:

Title:Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Paper and Code