Let's discuss monitoring

For the fourth year now, I have been organizing what is commonly called Observability. I denounce the experience gained during this time in a text, share it with you in the form of a reflection-recommendation and submit it to the public. There will be practically no technical details - the article is deliberately written in such a way that the stated could be put on almost any technology stack. The fact is that tools break into trends and leave them at an incredible speed, so their choice is yours. Let's discuss monitoring.





About monitoring in the context of metrics

If you ask the average techie-engineer what monitoring is associated with, then most likely they will answer you - "application metrics", and this will mean their collection and some visualization. Moreover, as my experience has shown, many do not even think about the seamy side of this process - in the understanding of the majority, "it is simply shown in Grafana / Kibana / Zabbix / substitute what you need."





This answer, I note, is still not complete, since everything is not limited to metrics alone. More precisely, even this: monitoring is not only about collecting metrics and displaying them on a dashboard. And from now on, let's take a closer look.





What is monitoring made of?

Over time, I deduced the following aspects for myself:





  1. Collection of metrics from various sources - applications, host indicators, "iron" part of the site; the differences in pull and push models are not yet touched upon, more on that later





  2. Recording and their further (metrics) storage in the database, taking into account the peculiarities of the database itself and the use of the collected data





  3. Visualization of metrics, which should balance between the capabilities of the selected technological stack, the usability of dashboards and the "wishlist" of those who will have to work with it all





  4. Tracking metrics readings according to specified rules and sending alerts





  5. In the case of advanced monitoring, one more point can be added here - anomaly detection and proactive reporting of the degradation of the observed system based on ML.





About collecting metrics

So, you've decided to create a monitoring system. The first step is to think about what metrics to collect:





  • - «» - CPU, RAM, , , ; – , .

    , , ; , / K8s-





  • – , , ; , , .

    ( ) Β«-Β». – Β«/ Β», , ,





  • - – , -.

    – , , (-), , , . , , , –





  • - – , ; -,





Pull VS Push

– ?





Push- – , . ( , ), – , , - .





Pull- – , , . , . – , , , . , – . K8s, , , . – -.





– .





– TSDB (Time-Series DataBase), . , Β« – – Β» .





– VictoriaMetrics, .





, , :





  1. – , , , . () Β« /- Β».

    , - – Nginx`, Apache`, ; Β« Β» Β«- Β»,





  2. – , ; drilldown- . , , Β« ?Β».

    , Nginx , – , , . , -





  3. – .

    // . – Β« ?Β». , – nginx_01 proxy.local,





  4. – , , : - , , .

    , Nginx ; , , Β«/ /Β». , ,





  5. – .

    - «». – CPU, RAM, .., . , ; proxy.local,





, , :





The monitoring user moves from top to bottom, analyzing the incident
,

:





  • , . , , , , , ,





  • , – , Β«-Β» ..





  • . , – - ,





Grafana, , c , -.





, , . , – , , .





, , , . , :





  • – , .

    : Β« CPU 90% Β»; , , , -, , -, , .

    , , /// , – ( – , )





  • – /, ; , , uri - ..





  • – , , , ,





  • / , – , ,





, , , :





  • . // , - . , «» -





  • ; , , , , . ( )





  • / , ,





  • , . , , ,





  • – , . , Nginx ( ), , - Β« Β»





AlertManager – Prometheus, . Β« Β», . - API .





, , ; , .





This is the first text out of three planned - then I would like to touch on the topic of logging and its synergy with monitoring, after which, perhaps, move on to some technical details (not only in dry text). If you would be interested to read about this, please write in the comments. Let's try to disassemble and discuss first the general approach to collecting and centralized storage of logs, their role in assessing the state of the monitored site, and also touch on the question - "is it possible to separate logs from metrics?"








All Articles