"SRE is not only about alerts and postmortems, but also about the fact that the code that wakes up at night does not reach production."





On May 21, an SRE intensive course will start at Slurme. For three full days, participants will immerse themselves in the theory and practice of supporting high-load services. No work tasks, no family matters - just study. Under the cut, we tell you what awaits you if you decide to join.



SRE Help



Site Reliability Engineering (SRE) - ensuring the reliability of information systems. This is a new approach to supporting sites, services and applications with thousands and millions of users. The approach originated at Google and is now spreading around the world. In Russia, SRE has been implemented in Yandex, Mail.ru, Sberbank, Tinkoff, MTS, Megafon and other companies.



Experienced developers and system administrators become SRE engineers: deep knowledge of server operating systems, network operation, monitoring tools, as well as programming skills is important. All these hard skills are superimposed on the SRE methodology - specific practices that help to ensure high reliability.



β€œSRE is not so much about alerts and postmortems. This is about the fact that the code that undermines at night does not reach production. "



From communication with engineers who implemented SRE


For a long time, the main source of knowledge about SRE was the book of the same name from Google. There are now several English and Russian language training programs. One of them is the SRE intensive in Slurme.



Intensive format



The intensive course takes place online and consists of lectures and practical sessions. There will be a broadcast in Zoom and Telegram chat with speakers.



Two kinds of practice. Practical exercises are of two types: customization according to a sample and work on tasks, the solution of which is not predetermined. On the intensive course, they are called cases .



Teamwork on a real service. To work on the cases, the participants of the intensive are united in teams of 5-8 people. Each team receives a stand with an application - several VDS, which hosts a website for ordering tickets .





A service for ordering tickets, the stable operation of which will be ensured by the participants of the Intensive



Failure Simulation.During the intensive, several major failures will occur in the work of the site, and the task of the team is to find the cause, eliminate and prevent its recurrence. The cases are based on real experience: the speakers collected the problems they faced during their SRE practice and created an environment to simulate these problems.



Experienced speakers. The intensive program was developed and will be conducted by:



  • Ivan Kruglov, Staff Software Engineer at Databricks.
  • Artyom Artemiev, Lead SRE at TangoMe.
  • Pavel Selivanov, Senior DevOps Engineer at Mail.ru Cloud Solutions.


Support. Curators will help to unite in teams and organize joint work. Speakers and technical support engineers from Slurm will support you in solving complex problems.



Remote format. Lectures are broadcast in Zoom, discussion of tasks takes place in Slack. All notes of lectures will be preserved and will be available after the intensive, it is useful to return to them after a while, already in a calmer atmosphere.



Three days of full immersion. The intensive is designed for three full days, from 10:00 to 18:00 Moscow time. There will be short breaks between lectures and lunch.



Start on May 21. There is still room.



Learn more and register



Below is the full intensive program.



Day 1: getting acquainted with the theory of SRE, setting up monitoring and alerting



On the first day, you will get acquainted with the theory of SRE, learn how to set up monitoring and alert, and also team up with other participants in the intensive.



Let's talk about metrics SLO, SLI, SLA and how they relate to business requirements. We will share Best Practices for setting up monitoring and rules for the fire brigade. We will give the first practical cases.



Topic 1: Monitoring



  • Why do you need monitoring,
  • Symptoms vs Causes,
  • Black-Box vs. White-Box Monitoring,
  • Golden Signals,
  • Percentiles,
  • Alerting,
  • Observability.


Practice: Making a basic dashboard and setting up the necessary alerts.



Topic 2: SRE theory



  • SLO, SLI, SLA,
  • Durability,
  • Error budget.


Practice: Adding SLO / SLI + alerts to the dashboard.

Practice: First load of the system.



Case 1: downstream addiction. In a large system, there are many interdependent services, and they do not always work equally well. It is especially offensive when your service is in order, and the neighboring one, on which you depend, periodically goes down. The training project will find itself in such conditions, and you will make it so that it still gives out the quality at the highest possible level.



Topic 3: Incident Management



  • Resilience Engineering,
  • How the fire brigade lines up
  • How effective is your team in the incident,
  • 7 rules for an incident leader,
  • 5 rules for a firefighter,
  • HiPPO - highest paid person's opinion. Communications Leader.


Case 2: upstream addiction. It's one thing when you depend on a service with a low SLO. It's another matter when your service is such for other parts of the system. This happens if the evaluation criteria are not agreed: for example, you respond to a request within a second and consider it a success, but the dependent service waits for only 500 Moscow time and leaves with an error. In the case study, we will discuss the importance of metrics reconciliation and learn to look at quality through the eyes of the client.


Topic 4: SRE onboarding a project

In large companies, it is not uncommon to form a separate SRE team, which takes on the support of services from other departments. But not every service is ready to be supported. We will tell you what requirements it must meet.



Day 2: solving problems with the environment and architecture



The second day is almost entirely built around solving two cases: problems with the environment (there will be a detailed analysis of Health Checking) and problems with architecture. Speakers will talk about working with post mortems and provide templates that you can use in your team.



Topic 5: Health Checking



  • Health Check in Kubernetes,
  • Is our service alive?
  • Exec probes,
  • initialDelaySeconds,
  • Secondary Health Port,
  • Sidecar Health Server,
  • Headless Probe,
  • Hardware Probe.


Case 3: environmental problems and the correct Healthcheck. Healthcheck's task is to detect a downtime service and block traffic to it so that users do not face a problem. And if you think that it is enough to root a request to the service and get an answer, then you are mistaken: even if the service responds, this does not guarantee its performance - there may be problems in the environment. Through this case, you will learn how to configure the correct Healthcheck and not let traffic go where it cannot be processed.


Topic 6: Practice of working with postmortems - we write a postmortem based on the previous case and analyze it with the speakers.



Topic 7: Solving Infrastructure Problems



  • Monitoring MySQL,
  • SLO / SLI for MySQL,
  • Anomaly detection.


Case 4: problems with the database. The database can also be a source of problems. For example, if you do not monitor the replication relay, then the replica will become obsolete and the application will return old data. Moreover, it is especially difficult to debug such cases: now the data is inconsistent, but after a few seconds it is gone, and what the cause of the problem is is not clear. Through the case, you will feel all the pain of debugging and learn how to prevent such problems.


Day 3: traffic shielding and canary releases



There are two cases about high availability of production: traffic shielding and canary deployment. You will learn about these approaches and learn how to apply them. We do not plan hardcore tuning by hand, although who knows.



Topic 8: Traffic shielding



  • behavior of graphs of growth in the number of requests and business operations
  • saturation and capacity planning
  • traffic shielding rate limiting
  • sidecar rate-limiting 100


5: traffic shielding. ? , 100 , 1000. , , , . , .



9: Canary Deployment



  • k8s (Rolling Update vs Recreate),
  • canary blue-green ,
  • blue-gree/canary release k8s,
  • canary release GitLab CI/CD,
  • canary release,
  • .gitlab-ci.yml.


6: . ,

, - . , . Canary Deployment, .





All Articles