Cyber ​​incident response: 5 rules for developing playbooks

The issue of developing and preparing playbooks on incident response is now very actively discussed and gives rise to a different number of approaches, finding a balance in which is extremely important. Should the playbook be very simple (“pull the cord, squeeze out the glass”) or should the operator think and make a decision based on his own expertise (although, as they said in one game of my childhood, “what is there to think, you have to pull it”). It's hard to find salt and silver bullet behind the influx of trendy acronyms and systemic guidelines. For 8 years of operation of our center for monitoring and responding to cyber attacks, we managed to break a lot of firewood gain some experience in this matter, so we will try to share with you the rakes and pitfalls that we met along the way, in 5 practical tips on the issue.







Synchronized Swimming Skills, or Aligning Monitoring and Response Tasks



As you know, the fastest animal in the world should be a centipede: it has the most legs. But the poor animal is really hampered by the need to synchronize its steps and activities. A similar story often haunts the life of the SOC (Security Operation Center). Analysts develop scenarios, wrinkle their brows, come up with new ways to detect attacks, and the response team often does not understand what to do with these incidents (and the distance grows if an external commercial SOC is on the first line of barricades). This state of affairs usually leads to two extreme situations:



  • SLA « , », , SOC, . , , - .
  • – «» . , , . , , . - - , SOC , . , , .


Remember the wise Yin Fu Wo, who asked his student who was buying a security scanner, what would he then do with the vulnerabilities found? Following his example, I really want to ask a question to the response team: what exactly and, most importantly, how quickly will you do with the incidents found?



The capabilities of the response process allow you to “line up” the list of incidents based on their internal criticality. For example, communications on critical changes in processes or attacks on the web can be logically switched to a phone call mode with the immediate collection of an investigation team. To handle the launch of TeamViewer on non-critical branch office machines, parsing for a couple of hours is an adequate option. And it is quite acceptable to submit a report on the statistics of viral infections once a day - for morning coffee and the "carpet" closure of problems, when just the massive curing of viruses, removal of prohibited software, updating the OS, closing vulnerabilities, etc. is taking place. This will help to substantially level the pace of monitoring and response work and create seamless and comfortable rules of the game throughout the entire incident management process.

Tip 1. Set your priorities. Since you will definitely not be able to deal with everything at once, determine what types of incidents are really critical for your company's business, and fix the required time frame for their resolution.


Analytics, analytics, even more incident analytics



Many commercial SOCs, as well as scripting teams, very often and seriously talk about the incredible percentage of false positives filtering, due to which customers are supposedly notified not of suspicions of incidents, but only of attacks (as they say, “In everything I want to get to the very essence ").



From time to time, this gives rise to amazing processes for analyzing and investigating incidents, for example, the following:



  • Based on the analysis of network traffic, the SOC recorded the launch of Remote Admin Tools (RAT) on the host.
  • , . – , RAT ( ) , .
  • « ». SIEM .


Let's leave aside the fact that in case of a hacker attack, connecting the machine after the fact at the log level is, to put it mildly, not a very deliberate action. An attacker could have simply rubbed all the necessary logs, and connecting to a newly compromised machine under an account with considerable privileges can lead to much more serious consequences than its compromise itself (especially for those fearless clients who, tired of understanding checkboxes to provide correct rights, give the account for SIEM domain administrator rights).



The main thing is that from the point of view of the result, this entire complex process of verification by the SOC and the security service is simply equivalent to picking up the phone and calling a specific user with a question whether he initiated the RAT session and for what reason. As a result, the answer itself can be received many times faster, and the total time spent on the investigation of the incident will be reduced significantly. Given that running RAT on a local machine 98% of the time is user-made (which only makes the remaining 2% more meaningful), this approach to response is much more efficient.

2. , . , , « », , , .

, –



It is impossible not to touch here on one topic that is often covered in the development of monitoring and response processes - the issue of inventory and accounting of assets. Most often, they talk about assets in the key of enriching information: in order to understand the significance of an incident, it is important to know what kind of network node it is, who is its owner, what software is installed there. But in terms of developing playbooks, this task takes on additional meaning: the very process of responding to an incident will directly depend on what kind of node it is and in what part of the network it is located.



Consider a fairly basic incident - host virus infection. It has completely different weight and criticality depending on where this host is located:



  • – , ;
  • VIP-, , – ;
  • .


The response processes in industrial companies require even more attention. The same incident with the launch of RAT on a machine acquires completely different accents and criticality if such a utility is launched where, according to the logic, it cannot be - for example, at the workstation of a technological process operator. In this case, the default response measure is to disconnect and isolate the host, followed by finding out the reasons for running the utility and a possible detailed analysis of the host for signs of potential compromise by an external attacker.

Tip 3. Conduct an inventory of assets. Superimpose different classes of incidents on network segments. Thus, you will no longer receive a linear model, where the level of criticality of an incident is determined solely by its type, but a base matrix customized for your organization, which can be improved and refined over time.

Real response vs perfect response



The situation described above highlights the key question: how deep is the internal team ready to respond to ensure elimination of the consequences and causes of the incident? Let's go back to the example with the malware infection - the response process may look like this:



  • Analysis of the malware penetration channel (mail / web / flash drive)
  • Obtaining information about the malware itself - which family, potential consequences, the presence of related utilities
  • Identifying indicators of compromise typical of a given malware, searching for indicators on neighboring machines (this is especially important when there is no full coverage of workstations and servers with anti-virus protection and malware can successfully penetrate one of the uncovered hosts)
  • Search for all related utilities in the infrastructure and remediation


But if this approach is applied for each of the viral bodies in the infrastructure and for each infection, it will result in very large labor costs. Therefore, a balanced approach to response is required here, depending on various external parameters:



  • The already mentioned model of assets and their criticality
  • The behavior of the malicious family - worms, especially those carrying a potentially destructive load, require more attention
  • "Old age" of the virus and its awareness of anti-virus laboratories
  • Its belonging to the grouping toolkit relevant to the company or industry


Depending on all these parameters, a decision can be made - from basic removal of malware from an ordinary car or reloading a critical host to a more complex response procedure with the involvement of specialized experts.

Tip 4. Do not be lazy to "sharpen the ax." Additional conditions make it possible to clarify the priority and algorithm of action in the course of responding to an incident. They allow not only to more fully perform all the necessary work to localize the incident and counter the attack, but also to avoid unnecessary movements in simpler cases.


Tell me who is your friend and I'll tell you how to be



Well, the depth of expertise on the part of the response team is certainly important in the development of playbooks. At the start of our work as a commercial SOC, all our communication was built through a dedicated person in the customer's information security service. In this case, we talked about an employee, even a young student, with a specialized education, who, responding to various incidents from time to time, accumulates his own expertise and does his job more and more efficiently.



We can conditionally divide playbooks into two types - technical and business. The first describes the process flow when dealing with an incident, and it is created for a response team from a reputable customer. And the second is a description of the chain of departments involved in the incident, and its consumer is rather line management. Accordingly, it is very important to “know your audience”, otherwise there will be “translation difficulties” associated with understanding and interpretation.



Recently, customers have increasingly included IT departments, business units, technologists, or even helpdesk directly in the response process. And this often leads to incidents. At the beginning of the pandemic, several customers were unexpectedly forced (like the whole country) to massively transfer their users to remote access. Since the second factor of authentication could not be implemented quickly, the following was agreed as a temporary scheme: each remote privileged connection was verified by helpdesk through a phone call. In the absence of feedback, the question was escalated to the business owner of the system, who could decide to continue work or block the account until the circumstances were clarified - in case of suspicion of inconsistent activity.In the playbook for the helpdesk, we have detailed the procedures for calling and searching for contacts of the business owner in as much detail as possible. But they did not write what the helpdesk employee should do when he received a command to lock the account (and the service had such rights). And the very first test run of the incident showed that, having received a letter with the message "illegitimate, blocking", the helpdesk specialist simply closed the application without performing any blocking.

Tip 5. Keep it simple. It is extremely important to take into account the specifics of the qualifications and "fluidity" of resources in the response team and to decompose the playbook from basic instructions with degrees of freedom for a specialist to a step-by-step "alphabet" for external services.
Developing a response process is a very creative thing for every company. However, it is very useful to take into account both your own and others' experience. And may NIST be with you.



All Articles