Is ML really useful for reducing alert noise? We study by the example of one method



Background



The last couple of years, the market for monitoring systems has been agitated by the acronym AIOps. All vendors have started to pursue the use of artificial intelligence in their complex and expensive systems. The terms “root cause analysis”, “correlation”, “ML-tools”, “anomaly detection”, “incident prediction”, “noise reduction” are thoroughly and probably forever settled on marketing materials and websites of various monitoring systems.



As we know, advertising brochures are one thing, but engineering everyday life is another. Probably, many have faced a situation when the promises of sellers of certain technological innovations collided, like the Titanic with an iceberg, with the practice of implementation, especially in the complex IT environment of large companies. Therefore, I initially looked with great skepticism and did not share the excitement around this topic. Moreover, when there are such reinforced concrete solutions as Zabbix, Prometheus and Elastic. But hype hyip, skepticism skepticism, and we are still engineers and should check and study everything in practice, and not wonder whether or not believe in the “magic button” from eminent vendors and promising startups. And now, after the next presentation of the integrator and promises of large sum of money "heaven on our sinful earth field engineers" We gathered a small initiative group,who decided to “feel” what this magic of artificial intelligence and machine learning is all about in our practice. Thus, materials and even a small pet-project were born, which I would like to share with you.





— , . . - . : -. — “ ”, .. , “ ”, . — “ ”.



ML- . , . - , .



. HTTP- . “”, . , downdetector , , , ;)







2020-10-14 14:00 +03:00 38 ( ), .. [2020-10-12 23:00:00 +03:00 – 2020-10-14 14:00 +03:00]. : 3612.



(threshold), , 0, 1, 179 . (. . 1: . UTC. ,

).



Fig. 1. 1. . UTC. , — .



, 3- , 44 (. . 2). 4 . “0110010011101010…”, , , % ( 1 ), - .



Fig. 2. 2. 3- . , — .



“” : - , . - , . , AI/ML.



ML?



, , Data Scientist . , , -, , 3- :



  1. . — , .
  2. , , , .
  3. , , "" . .. " " , , .


DetectIidSpike ML.NET. : . , . "" , . .

DetectIidSpike :



  • confidence — [0, 100]. , , , , ;
  • pvalueHistoryLength — p-value. - " ", .


, . HTTP- , .. . . - . , .. 5 : . , , .. . (, ), "", .



“”. , , , (), «» ( ). 5 . , websockets , . , ( kubernetes ).



(confidence: 95, pvalueHistoryLength: 5), 36 . , , .. . , 24 . (, ).



Figure:  3. 3. (confidence: 95, pvalueHistoryLength: 5) , —



(. 3), , . , , ( ).



. 4 pvalueHistoryLength=12 confidence: 98. : 14 .



Figure:  4. 4. (confidence: 98, pvalueHistoryLength: 12)





, DetectIidSpike (24 44) 3 , 7,5 (24 179) . , , . , ML . , :)



P.S.: ML, -, . .



PPS: Below I will give a few more screenshots from our pet-project with the real data of the checks carried out and the generated anomalies. You can see how efficiently or ineffectively (for whom how) the algorithm works (yellow circle - anomalies at the selected interval).



Some more interesting screenshots








All Articles