Fatal stack overflow. Why we lost our DNS and how to prevent this from happening in the future





Note: Bunny CDN is a content delivery network and cloud hosting with its own DNS servers.



If bunny.net is more important than performance, it is reliability. Everything is thought out. Redundant monitoring, auto-healing, multi-layer auto-healing, three backup DNS networks, and a system that ties it all together and guarantees uptime.



But in our situation, nothing helped. On June 22, 2021, after nearly two years of flawless operation, a DNS outage caused a complete shutdown of almost all systems. 750,000 sites went offline for more than two hours. In no time, we lost over 60% of our traffic and hundreds of gigabits of bandwidth. Despite all the backup systems, a simple update of one file caused a global failure.



We were extremely upset by this incident. But we hope to learn a lesson to build an even more robust platform. In a spirit of transparency, we want to share the details of the incident and how we are going to address similar issues in the future. Perhaps this story will become a lesson not only for us, but also for someone else.



It all started with a simple update



I'll tell you right away: the story is banal. It all started with a regular, standard update. We are performing a massive upgrade to improve the reliability and performance of the entire platform. Part of the process is improving the performance of the SmartEdge routing system. This system passes a large amount of data that is periodically synchronized with our DNS nodes. For optimal synchronization, our Edge Storage platform is used, which distributes large database files around the world via Bunny CDN.



To reduce RAM consumption, traffic and garbage collector load, we recently switched from JSON to the BinaryPack binary serialization library.... It worked great for several weeks. Reduced memory usage, GC latency, CPU usage ... until things broke.



On June 22nd at 8:25 UTC, we released an update to reduce the size of the optimization database. Unfortunately, this caused the corrupted file to be uploaded to Edge Storage. The damaged file itself is not a problem, because DNS is designed to work with or without data. And he gracefully ignores any exceptions. More precisely, we thought so.



It turned out that the corrupted file caused the BinaryPack library to execute immediately with a stack overflow, bypassing any exception handling by simply terminating the process. Within a few minutes, the global network of hundreds of DNS servers went down almost completely.





DNS servers working hours: UTC + 7



Then it's harder ...



It took some time to realize what was happening. After ten minutes, we realized that DNS servers cannot be returned to their original state - they are constantly restarted and crashed.



At first it seemed that everything was under control. Especially for such cases, it is possible to immediately roll back any deployment with one click of a button. But it soon became clear that the situation is much more complicated than it seems. We quickly rolled back all SmartEdge updates, but it was too late.



Both SmartEdge and deployment systems serve data to physical DNS servers via Edge Storage and Bunny CDN, and we just took out the bulk of the global CDN.



Although the DNS server is able to recover automatically, every time it tries to load a broken deployment, it just crashes again. As you can imagine, DNS servers cannot reach the CDN to download an update - and the circle is closed.



At 8:35 (15:35), several servers were still trying to work, but it didn't help much, so our bandwidth dropped to 100-150 Gbps. We were lucky only that everything happened in the early morning, with almost minimal traffic.





CDN traffic graph: UTC + 7



... and even harder



At 8:45 am, we came up with a plan. Let's manually roll the update that will disable SmartEdge on the DNS hosts! At first it seemed that everything was working. But we were very, very wrong. Due to a CDN failure, the DNS servers eventually downloaded corrupted versions of the GeoDNS databases - and all of a sudden, all requests went to Madrid. Since this is one of our smallest points , it instantly fell.



To make matters worse, now 100 servers began to restart in a loop, which led to the failure of the central API - and even those few servers that were able to be brought back to life stopped starting.



It took a while to figure out what was really going on, but after numerous attempts we gave up on the idea of ​​rebuilding the network.



The situation has reached an impasse. The services desperately needed to be started, but a single corrupted file nearly killed the platform.



We bring the situation back under control



Since all the internal distributions were now corrupted and served via a CDN, an alternative had to be found. At about 9:40 am, we got another idea. If it is possible to send all requests to one region, then let's direct them to the largest one, as a temporary measure. The routing update redirected all requests to Frankfurt.



This was the first success - and a decent part of the traffic returned to the network. But the problem cannot be solved this way. We manually deployed the update to a few DNS servers, but the rest still sent queries to Madrid. It was necessary to act quickly.



It became clear: it was time to admit defeat. The only way out is to completely abandon your own systems. Devops got to work and painstakingly transferred all deployment systems and files to third-party cloud hosting.



At 10:15 am everything was ready. We reconfigured the deployment system and DNS software for the new hosting and clicked the Deploy button . Traffic slowly but surely returned - and at 10:30 the system was back online. More precisely, it seemed so.



It is clear that everyone is in a panic and fidgeting. We tried to get the system up and running as quickly as possible, while simultaneously responding to hundreds of support tickets and explaining the situation. Of course, in a hurry, we made a bunch of typos and mistakes. Everyone knows how important it is to stay calm in these situations, but this is easier said than done.



It turned out that in a hurry we installed the wrong version of the GeoDNS database, so the DNS clusters were still sending requests to Madrid. The situation became very annoying, but it's time to calm down, recheck the settings and calmly conduct the final deployment.



At 10:45 am, we did exactly that. Through a third-party cloud, it was possible to synchronize the database, deploy the latest file sets and bring up the entire system.



For half an hour, we closely watched the traffic growth and meticulously checked the systems. The storage was pushed to the limit, because without the SmartEdge system, we were serving a large amount of uncached data. Finally, at 11:00, things started to stabilize and bunny.net returned to work.



So what went wrong?



All systems were designed and tuned to work together. They relied on each other to include critical parts of the internal infrastructure. If you create a bunch of cool infrastructure, then of course, you want to apply it to the maximum number of your own products.



Unfortunately, due to this approach, one damaged file brought down several levels of redundancy. First, the DNS system went out of order, then the CDN, then the storage and, finally, the site optimization service.



In fact, during the return of hundreds of servers, the domino effect also knocked out the API and dashboard, which in turn caused the logging service to crash.



Stuffing bumps to get stronger!



Although such a "failure" could have been avoided, we take it as a valuable lesson. The system is not perfect, but you need to strive for the ideal. The only way is to learn from your mistakes.



First of all, we want to apologize to all the victims and assure us that we are approaching this issue with special care. We have not had large-scale system failures for several years, and we are determined to guarantee stable operation in the near future.



The first and smallest step will be to phase out the BinaryPack library and test all third-party libraries more thoroughly in the future.



A more serious problem became apparent. Building infrastructure within your own ecosystem can have dire consequences. We saw the domino effect on the example of Amazon and hoped that this would not happen to us, but we miscalculated ...



We are currently planning a complete migration of internal APIs to a third-party service. Thus, if their system fails, then we only lose the update mechanism. But if our system fails, then it will be possible to respond quickly and reliably, avoiding the domino effect when the infrastructure falls brick by brick.



You should also consider how to eliminate a single point of failure across multiple server clusters due to a single file that is otherwise considered non-critical. We always try to roll out updates gradually using the canary method. But this incident took us by surprise, as the non-critical part of the infrastructure became a critical single point of failure for several other clusters.



Finally, in the DNS system itself, you need to keep a local copy of all backups with automatic failure detection. Thus, another level of redundancy appears - all systems are as independent of each other as possible, so that in the event of a failure, a domino effect does not occur.



I want to thank the tech support guys who worked tirelessly and kept everyone in the loop, as well as the users for their patience while we were solving the problem.



It is clear that this is a very stressful situation for our clients. Therefore, we are trying to learn lessons and as a result, become an even more reliable CDN than before.



All Articles