Post Mortem on Amazon Kinesis Massive Disruption in US-EAST-1 (Nov 25)

Approx. transl. : Last week, an outage of one of the AWS services resulted in availability / correct functioning of a number of cloud services from this major provider. The official publication, promptly posted by the engineers of the Internet company, tells about the details of the incident, its causes and, most importantly, the lessons that have been learned from the incident. We present to your attention its translation.



In this post, we would like to share the details of the service disruption that occurred in Northern Virginia (US-EAST-1) on November 25, 2020.



Amazon Kinesis allows you to collect, process, and analyze streaming data in real time. In addition to direct use by customers, it is involved in a number of AWS services. These services also suffered from an outage. The trigger (but not the main cause) of this event was the relatively small addition of capacity to the service, which began at 2:44 a.m. PST and ended at 3:47 a.m.



About the Kinesis device



Kinesis uses a large number of "back-end" clusters of cells (cells) that process data streams. These are the workhorses of Kinesis. They are responsible for distribution, access and scaling for streaming. Streams are distributed by front-end servers to back-end servers using sharding. The backend cluster "owns" many shards and provides consistent scaling and fault isolation. The front-end work in our case is small, but important. It is responsible for authenticating, throttling, and routing requests to the correct thread shards on backend clusters.



We were adding new capacities to the front-end machine fleet. Each front-end server forms a cache of data, including information about membership and owners of shards (among back-end clusters), forming a so-called shard-map. It obtains this information by sending requests to the service that provides membership information and retrieving configuration data from DynamoDB.



In addition, each server continuously processes messages from other Kinesis front-end servers. To do this, threads are created in the OS of each front-end machine for each of the front-end servers. When new capacities are added, the servers that are already operating in the front-end park learn about new members and create the corresponding streams. It takes up to an hour for every existing front-end server to know about new machines.



Diagnostics and problem solving



At 5:15 PST, the first error messages appeared while writing and retrieving Kinesis records. Our teams immediately started studying the logs. Suspicion immediately fell on new capacities, but some of the errors had nothing to do with the new machines and, most likely, would not have gone anywhere, even if we removed all the new capacities.



Nevertheless, as a precaution, we nevertheless began to delete them, along the way trying to establish the cause of other errors. Their wide variety slowed down our diagnosis. We saw bugs in every aspect of all kinds of calls made by existing and new members of the front-end fleet, and this made it pretty difficult to separate the side effects from the root cause.



As of 7:51 AM PST, we have narrowed down the suspects to just a few candidates and determined that any of the most likely causes would require a full front-end restart. The Kinesis team knew very well that this process should be slow and detailed.



The fact is that filling the shard card competes with the processing of incoming requests for the resources of the front-end server. Thus, bringing the front-end servers back online too quickly will create a conflict between the two processes and leave too little to process incoming requests. The result is predictable: an increase in error rates and an increase in latencies. In addition, slow front-end servers can be perceived as a sign of unhealthy, which can cause them to be removed from the list of available servers, which, in turn, will further slow down the recovery process.



All possible solutions involved changing the configuration of each front-end server and restarting it. While our prime candidate for the source of our troubles (a problem that seemed to put pressure on memory) looked promising, if we were wrong, we risked doubling the recovery time, since we would have to re-fix and restart everything. To speed up the restart, in parallel with the investigation, we began to make changes to the configuration of the front-end servers, allowing you to receive data directly from the metadata store at boot time, and not from the front-end neighbors.



main reason



At 9:39 am PST, we were finally able to confirm the root cause of the crash. It turned out that it is not related to memory. The addition of new capacities resulted in the number of threads on all front-end servers exceeding the maximum possible, allowed by the system configuration. Because the limit was exceeded, the cache (shard cards) could not be created. As a result, the front-end servers were unable to forward requests to the back-end clusters.



We did not want to increase the thread limit in the OS without preliminary testing, and since the additional capacity had already been removed at that time, we decided that the risk of exceeding the system limit on the number of threads was minimal and continued to restart the servers. The first group of fresh frontends started accepting Kinesis traffic at 10:07 PST.



The front-end fleet consists of many thousands of servers, and for the reasons described above, we could add servers at a rate of no more than a few hundred per hour. We continued to slowly add traffic to the frontend, noting the steady decline in Kinesis service errors since midday. Service completely bounced back at 10:23 PM PST.



What have we learned



We have learned several lessons from the Kinesis incident and are planning to make corrections in the near future.



  • , CPU . , , , . , . , .

  • .
  • , . , .
  • -. - AWS ( CloudWatch) , -.
  • (cellularization) ( , ). ( -) . Kinesis , , , . / , , .




The crash also affected some services using Kinesis.



Amazon Cognito uses Kinesis Data Streams to collect and analyze API access patterns. While this information is extremely useful for the operation of the Cognito service, it is not guaranteed to be delivered (best effort) . Data is buffered locally to help the service cope with delays or short periods of unavailability in the Kinesis Data Streams service.



Unfortunately, the protracted unavailability of Kinesis Data Streams revealed a latent bug in the buffering code that caused Cognito web servers to start blocking Kinesis Data Stream buffers. As a result, Cognito consumers faced API glitches and increased latency for Cognito User Pools and Identity Pools. External users were unable to authenticate or obtain temporary AWS credentials.



In the early stages of the outage, the Cognito team tried to mitigate the impact of Kinesis bugs by adding additional capacity and thereby increasing the call buffering capabilities of Kinesis. Initially, this had a beneficial effect on the service, but by 7:01 am PST, the error rate had increased significantly. In parallel, the team worked to reduce Cognito's dependency on Kinesis. At 10:15 am, this solution was deployed and the error rate began to decline. By 12:15 pm, the error rate had dropped significantly, and at 2:18 pm PST, Cognito was functioning normally. To prevent this issue from reoccurring, we have modified the Cognito web servers. They can now tolerate Kinesis API errors without depleting their buffers (leading to user issues).



CloudWatchuses Kinesis Data Streams to process metrics and logs. Starting at 5:15 a.m. CloudWatch was experiencing increasing errors and latencies for the PutMetricData and PutLogEvents APIs, and alerts went into state INSUFFICIENT_DATA. While some CloudWatch metrics continued to be processed during the outage, most of them fell victim to numerous errors and increased latency.



At 5:47 p.m. PST, the first signs of recovery appeared as the Kinesis Data Stream situation improved, and by 10:31 p.m. CloudWatch metrics and alerts had fully recovered. In the following hours, processing of delayed metrics and logs continued. While CloudWatch was struggling with bugs, internal and external clients were unable to deliver metrics data to CloudWatch. As a result, there are gaps in the CloudWatch metrics data.



At the moment, the CloudWatch service relies on Kinesis to collect metrics and logs, but its team will soon implement a change after which CloudWatch will store data for three hours in local storage. This change will allow users and services tied to CloudWatch metrics (including AutoScaling) to directly access newly collected metrics (in on-premises CloudWatch datastore). This idea has already been implemented in the US-EAST-1 region, and in the coming weeks we plan to roll it out globally.



Two more services became hostages of problems with CloudWatch metrics:



  • First, AutoScaling policies based on CloudWatch metrics experienced latency until 5:47 PM, the point where CloudWatch began to bounce back.
  • -, Lambda. - CloudWatch. Lambda, , , CloudWatch . 6:15 PST , , -. : . 10:36 PST , .


CloudWatch Events and EventBridge experienced an increase in API errors and latency in event processing from 5:15 AM ET. After improving availability, Kinesis EventBridge resumed delivery of new events to addressees, simultaneously processing the accumulated events.



Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS) use EventBridge in their internal workflows to manage client clusters and tasks. This affected the provisioning of new clusters, delayed scaling of existing ones, and affected the de-provisioning of tasks. By 4:15 PM PST, most of these issues had been resolved.



Customer notification



In addition to difficulties with services, at the very beginning of the incident, we encountered certain delays in communicating information about the status of services to customers.



We have two ways of communicating with customers during operational events:



  1. Service Health Dashboard - A publicly available dashboard for alerting large operational issues;
  2. Personal Health Dashboard - for direct notification of affected customers.


During events like this, we usually post information to the Service Health Dashboard. However, in this case, at the very beginning of the crash, we were unable to update the information in the Service Health Dashboard, because the tool that is used to publish updates is used by the crashed Cognito.



We have a fallback method for updating the Service Health Dashboard with minimal service dependencies. It worked as intended, but we experienced some delays when publishing information to the Service Health Dashboard at the start of the event. The point is that this backup tool is much less automated and less familiar to helpdesk operators.



To ensure timely delivery of updates to all affected customers, the support team took advantage of the Personal Health Dashboard to alert them of potential service issues. We also put up a global banner with up-to-date information in the Service Health Dashboard to ensure that users are fully informed about the incident. Until the end of the crash, we continued to use a combination of Service Health Dashboard (with summaries on global banners and details on how specific services work) and Personal Health Dashboard, where we tried to keep customers affected by problems with services up to date. Based on our experience, we have introduced mandatory exercises with a fallback system for posting messages to the Service Health Dashboard in our regular support engineer training.



...



Finally, I would like to apologize for the negative impact this incident has had on our customers. We pride ourselves on the high availability of Amazon Kinesis and are well aware of how important this and other AWS services are to our customers, their applications / end users, and their businesses. We will do our best to learn from this incident and use it to further improve accessibility.



PS



Read also on our blog:






All Articles