Identifying Anomalies in Microservices Architecture - An Overview of DevOps and SRE Tools

Hello. Today we would like to talk about detecting anomalies in a microservice environment. This post is a short summary of our 40-minute report that we did at the DevOps Live 2020 online conference and, in order not to write a longread, we decided to focus on an overview of the tools for detecting anomalies in the distribution of metric values for automating monitoring of microservices, which can be quickly used by any team ...

The topic of anomaly detection is now very relevant, since with the transition to microservices for SRE and DevOps, the priority of tasks related to converting alerts into a meaningful signal, reducing MTTD and simplifying the configuration of alerts in monitoring distributed environments has significantly increased.

, , , .

"" .

, , .

?

?

, :

latency ;
;
.

"" , - , .

, :

;
, ;
«» , .

, , , ?

:

c ;
APM ;
as a Service.

.

, Python R.

Prometheus , time series .

recording rules, , .

, , , ( " ").

, , z- (z-score) — , , .

http_requests_total, :

#    
- record: job:http_requests:rate5m
  expr: sum by (app) (rate(http_requests_total[5m]))

:

# average -   
- record: job:http_requests:rate5m:avg_over_time_1w
expr: avg_over_time(job:http_requests:rate5m[1w])

# stddev -  
- record: job:http_requests:rate5m:stddev_over_time_1w
expr: stddev_over_time(job:http_requests:rate5m[1w])

# z-
(job:http_requests:rate5m - job:http_requests:rate5m:avg_over_time_1w
) /  job:http_requests:rate5m:stddev_over_time_1w

Simple anomaly

( , latency) — , , .

— .

, .

.

, — z-.

Seasonal prediction

recording rules Prometheus .

Prometheus — PAD

Prometheus Anomaly Detector (PAD), Red Hat, , .

PAD Prometeheus , PAD recording rules, , , Prophet, .

PAD architecture

PAD Grafana .

PAD architecture

, proof of concept.

APM

(Application Performance Monitoring) AIOps — , , .

, .

New Relic

New Relic baseline ( ) — , EUM, .

— baseline, ( , , ).

, , , , baseline.

, .

New Relic - setting the policy for alert on deviation from the baseline

2020 — New Relic Applied Intelligence (AI).

New Relic AI KPI .

/ .

New Relic Applied Intelligence - Detecting anomalies in metrics across multiple applications

AppDynamics

AppDynamics APM baseline KPI- .

baseline , , (, ) , baseline.

AppDynamics - baseline setting

, , health rule .

, baseline health rule.

AppDynamics - setting policy for alert on deviation from baseline

Dynatrace

Dynatrace " " , .

Dynatrace - signal of traffic decrease

:

KPI

.

Dynatrace - setup

Dynatrace - setup

Instana

Instana " " 230 "" , KPI .

latecy, error rate, traffic ( ).

Instana - a list of rules that use the EDM algorithm to detect anomalies

E-Divisive with Medians (EDM).

Instana - the rule has detected an anomaly in the metric

, , baseline.

"" "" , .

baseline — .

EUM.

Instana - Alerting policy constructor based on EUM baseline metrics

as a Service

APM , Prometheus , , SaaS .

Azure Metric Advisor

Microsoft — Azure Metric Advisor .

, , e-commerce.

(SQL Server, ElasticSearch, InfluxDB, MongoDB, MySQL, PostgreSQL ), Prometheus .

Azure Metric Advisor interface

Anodot

— Prometheues -.

-, SRE .

e-commerce, gaming .

Anodot

AnomalyIO

, , , , InfluxDB.

, InfluxDB, , .

Anodot

.
– , .
Prometheus — .
APM AIOps, .

.

All Articles