Online intensive SRE: we will break everything to the ground, then fix it, break it down a couple more times, and then rebuild it

Let's break something? Otherwise we build and build, fix and fix. Boredom is mortal.



Let's break it down so that we have nothing for it - not only will we be praised for this disgrace. And then we'll rebuild everything - so much so that it will be an order of magnitude better, more fault-tolerant and faster.



And we'll break it again.



Do you think this is a competition for the use of the most secret instrument of all our astronautics - the Big Russian Space Hammer?



No, this is an online SRE intensive. It just so happened that every Slurm SRE coursenever and for anything unlike the previous one. Simply because you will never guess that in a huge complex system, to which thousands and thousands of users connect every second, and the audience itself is several million, it can fall off, break down, dull, glitch and in hundreds of other ways to ruin the mood of the duty shift of SRE engineers.



In December we will hold another SRE intensive .



image



Let's arrange a small retrospective. Consider how, just a few years ago, HR ran a race to find more DevOps engineers in their company. The prize has changed. Now they, like a tracking system "Pantsir-C1", inspect the surrounding area, looking for SRE-engineers. I told in the article “ Eugene Varavva, a developer at Google. How to describe Google in 5 words ”, how an SRE engineer lives at Google, and how even such a corporation is experiencing a shortage of SRE specialists.



On the online intensive Slurm SRE in December, in three days, from 10:00 to 19:00, you will learn how to ensure the speed, fault tolerance and availability of sites in conditions of limited resources, eliminate IT incidents and conduct debriefing so that problems do not recur.



Course speakers:



Ivan Kruglov . Staff Software Engineer at Databricks. Has experience in enterprise companies in distributed delivery and message processing, BigData and web-stack, search, internal cloud building, service mesh.



Pavel Selivanov . Senior DevOps Engineer at Mail.ru Cloud Solutions. On account of dozens of built infrastructures and hundreds of written CI / CD pipelines. Certified Kubernetes Administrator. Author of several courses on Kubernetes and DevOps. Regular speaker at Russian and international IT conferences.



Everything will be tough, unpredictable and in practice. You will build, break and repair - and sometimes in a variety of sequences.



Build:You have to formulate SLO, SLI, SLA indicators for a site consisting of several microservices; develop an architecture and infrastructure that will support them; build, test and deploy the site; configure monitoring and alerting.



Break: You will consider internal and external factors of SLO deterioration: developer errors, infrastructure failures, influx of visitors, DoS attacks. Learn to understand resilience, error budget, test practice, interrupt management, and operational load.



Fix: You will be trained to quickly and effectively organize the work of the emergency response team in the shortest possible time: connect colleagues, notify stakeholders, and set priorities.



Study:You will be able to parse the site approach in terms of SRE. Analyze incidents. Determine how to avoid them in the future: improve monitoring, change the architecture, approaches to development and operation, regulations. Automate processes.



The online SRE intensive simulates real conditions - the time to restore the service's performance will be extremely limited. As in real life, as in a real work situation.



You can find out the terms of the SRE course, as well as study the full program, here .



The online intensive is scheduled for December 2020. For those who pay for participation in advance, we have prepared a discount.



Are you ready for intense training, challenging challenges and sudden accidents?



It just won't. There will be professional growth.



All Articles