Identifying Anomalies in Microservices Architecture - An Overview of DevOps and SRE Tools

Hello. Today we would like to talk about detecting anomalies in a microservice environment. This post is a short summary of our 40-minute report that we did at the DevOps Live 2020 online conference and, in order not to write a longread, we decided to focus on an overview of the tools for detecting anomalies in the distribution of metric values ​​for automating monitoring of microservices, which can be quickly used by any team ...







The topic of anomaly detection is now very relevant, since with the transition to microservices for SRE and DevOps, the priority of tasks related to converting alerts into a meaningful signal, reducing MTTD and simplifying the configuration of alerts in monitoring distributed environments has significantly increased.













, , , .

"" .







, , .







?

?







, :







  • latency ;
  • ;
  • .


"" , - , .







, :







  • ;
  • , ;
  • «» , .


, , , ?







:







  • c ;
  • APM ;
  • as a Service.


.









, Python R.







Prometheus , time series .

recording rules, , .







, , , ( " ").







, , z- (z-score) β€” , , .







http_requests_total, :







#    
- record: job:http_requests:rate5m
  expr: sum by (app) (rate(http_requests_total[5m]))

      
      





:







# average -   
- record: job:http_requests:rate5m:avg_over_time_1w
expr: avg_over_time(job:http_requests:rate5m[1w])

# stddev -  
- record: job:http_requests:rate5m:stddev_over_time_1w
expr: stddev_over_time(job:http_requests:rate5m[1w])

# z-
(job:http_requests:rate5m - job:http_requests:rate5m:avg_over_time_1w
) /  job:http_requests:rate5m:stddev_over_time_1w
      
      





Simple anomaly







( , latency) β€” , , .







β€” .







, .







.







, β€” z-.







Seasonal prediction







recording rules Prometheus .







Prometheus β€” PAD



Prometheus Anomaly Detector (PAD), Red Hat, , .







PAD Prometeheus , PAD recording rules, , , Prophet, .







PAD architecture







PAD Grafana .







PAD architecture







, proof of concept.







APM



(Application Performance Monitoring) AIOps β€” , , .







, .







New Relic



New Relic baseline ( ) β€” , EUM, .







β€” baseline, ( , , ).

, , , , baseline.







, .







New Relic - setting the policy for alert on deviation from the baseline







2020 β€” New Relic Applied Intelligence (AI).







New Relic AI KPI .







/ .







New Relic Applied Intelligence - Detecting anomalies in metrics across multiple applications







AppDynamics



AppDynamics APM baseline KPI- .







baseline , , (, ) , baseline.







AppDynamics - baseline setting







, , health rule .







, baseline health rule.







AppDynamics - setting policy for alert on deviation from baseline







Dynatrace



Dynatrace " " , .







Dynatrace - signal of traffic decrease







:







  • KPI


.







Dynatrace - setup







Dynatrace - setup







Instana



Instana " " 230 "" , KPI .







latecy, error rate, traffic ( ).







Instana - a list of rules that use the EDM algorithm to detect anomalies







E-Divisive with Medians (EDM).







Instana - the rule has detected an anomaly in the metric







, , baseline.

"" "" , .







baseline β€” .







EUM.







Instana - Alerting policy constructor based on EUM baseline metrics







as a Service



APM , Prometheus , , SaaS .







Azure Metric Advisor



Microsoft β€” Azure Metric Advisor .







, , e-commerce.

(SQL Server, ElasticSearch, InfluxDB, MongoDB, MySQL, PostgreSQL ), Prometheus .







Azure Metric Advisor interface







Anodot



β€” Prometheues -.







-, SRE .







e-commerce, gaming .







Anodot







AnomalyIO



, , , , InfluxDB.







, InfluxDB, , .







Anodot









  • .
  • – , .
  • Prometheus β€” .
  • APM AIOps, .


.








All Articles