The story of how we found ourselves on the verge of bankruptcy, not even having time to launch the first product, how we managed to survive and what lessons we learned.
In March 2020, when COVID hit the world, our startup Milkie Way was also hit hard and nearly shut down. We burned $ 72,000 while researching and internally testing Cloud Run with Firebase for several hours.
I started developing the Announce service in November 2019. The main goal was to release the minimum functional first version of the product, so the code worked on a simple stack. We used JS, Python and deployed our product on Google App Engine.
With a very small team, we focused on coding, UI development, and product preparation. I spent almost no time managing the cloud - I spent just enough to get the system up and running and provide a basic development process (CI / CD).
Desktop Announce
The first version was not very convenient, but we just wanted to release a version for experiments, and then work on the normal one. Due to COVID, we thought this is a good time to launch as government agencies around the world can use Announce to publish alerts.
Isn't it great to generate some data on the platform when users haven't uploaded their content yet? This thought led to another project, Announce-AI, for content generation. Data rich are various events such as earthquake alerts and possibly relevant local news.
Some technical details
We used Cloud Functions to start developing Announce-AI. Since our scraping bot was still in its early stages, we decided to take these lightweight features. But there were problems when scaling because cloud functions have a timeout of about 9 minutes.
And suddenly we learned about the Cloud Run system, which then had a large free limit! Without understanding it completely, I asked the team to deploy the "test" function Announce-AI in Cloud Run and evaluate its performance. The goal was to play around with Cloud Run to gain experience.
Google Cloud Run
Since we have a very small site, we used the Firebase database for simplicity, since Cloud Run does not have any storage, and the deployment of SQL Server or another database is too excessive for the test.
I created a new GCP ANC-AI Dev project, set up my cloud billing budget at $ 7, saved my Firebase project with a free plan (Spark). The worst case we have imagined is going over the daily Firebase limit.
After some modifications, we prepared the code, made a few manual requests, and then left it running.
The nightmare begins
On the day of testing, everything went well, and we returned to the development of Announce. The next day after work, in the late afternoon, I went to take a nap. When I woke up, I saw several emails from Google Cloud, all at intervals of several minutes.
First letter: automatic upgrade of our Firebase project
Second letter: budget exceeded
Fortunately, my card had a limit of $ 100. Because of this, payments did not go through, and Google suspended the service of our accounts.
Third letter: card declined.
I jumped out of bed, entered Google Cloud billing and saw an invoice for about $ 5000. In a panic, he began to click on the keys, not understanding what was happening. In the background, I began to ponder how this could have happened and how to pay the $ 5,000 bill, in which case.
The problem was that the score kept growing every minute.
In five minutes he showed $ 15,000, in 20 minutes - $ 25,000. I did not understand when the numbers would stop increasing. Maybe they will grow indefinitely?
Two hours later, the figure stopped at just under $ 72,000.
By this time, the team and I were on a teleconference, I was completely shocked and had absolutely no idea what to do next. We turned off billing, closed all services.
Since in all GCP projects we settled with one card, all our accounts and projects were suspended.
The nightmare continues
This happened Friday night, March 27th - three days before we planned to launch the first version. Now development has stopped because Google has suspended all of our projects linked to one map. My morale was below the baseboard, and the future of the company seemed uncertain.
All our cloud projects are suspended, development is stopped.
As soon as my mind resigned itself to the new reality, at midnight I decided to figure out what had happened normally. I started drafting a document with a detailed investigation of the incident ... and called it "Chapter 11" [this is a chapter from the bankruptcy law - approx. per.].
Two colleagues who participated in the experiment also stayed awake all night, researching and trying to understand what happened.
The next morning, Saturday March 28th, I called and wrote letters to a dozen law firms to make an appointment or speak to a lawyer. They were all away, but I was able to get a response from one of them via email. Since the details of the incident are so complex, even for engineers, explaining this to a lawyer in plain English was not easy in itself.
For us, as a start-up startup, there was no way to recover $ 72,000.
By this time, I had already thoroughly studied the 7th and 11th chapters of bankruptcy law and mentally prepared for what might happen next.
Some respite: GCP loopholes
On Saturday, after sending out emails to the lawyers, I began to read and go through every page of the GCP documentation. We did make mistakes, but there was no point in Google letting us spend $ 72,000 dramatically if we hadn't made any payments before!
GCP and Firebase
1. Automatic upgrade of a Firebase account to a paid account
We did not expect this, and this was not warned about it anywhere when registering on Firebase. Our GCP billing was plugged into Cloud Run execution, but Firebase came under a free plan (Spark). GCP just out of nowhere upgraded to a paid plan and charged us the required amount.
It turns out that they call this process "deep integration of Firebase and GCP".
2. There are no billing "limits". Budgets are late at least one day
GCP billing is effectively delayed by at least 24 hours. In most of the docs, Google suggests using budgets and the automatic cloud shutdown feature. But by the time the shutdown function is triggered or a notification is sent to the user, the damage has already been done.
It takes about a day to sync billing, which is why we noticed the bill the next day.
3. Google should have taken $ 100, not 72 thousand!
Since no payments have been made from our account so far, GCP had to first take a fee of $ 100 in accordance with the payment information, and if it did not pay, it would stop the service. But that did not happen. I figured out the reason later, but this is also not the user's fault!
The first bill for us was about $ 5000. The next one is $ 72K.
The billing threshold for our account is $ 100
4. Don't rely on your Firebase dashboard!
Not only billing, but also updating the Firebase dashboard took over 24 hours.
According to the Firebase Console documentation, the numbers in the dashboard may "slightly" differ from the billing reports.
In our case, they differed by 86,585,365.85%, or 86 million percentage points. Even when the invoice came in, the Firebase Console was still showing 42,000 reads and writes per month (below the daily limit).
New day, new challenge
After six and a half years at Google and writing dozens of project documents, investigation reports, and more, I started writing a document for Google, describing the incident and adding Google loopholes to the report. The Google team will be back to work in two days.
Correction: Some readers have suggested that I was using my internal Google contacts. In fact, I did not communicate with anyone and chose the path that any normal developer or company would follow. Like any other small developer, I spent countless hours chatting, consulting, drafting long emails, and reporting bugs. In one of the following articles on Incident Reporting, I will show the documents I submitted to Google.
Last day on Google
In addition, it was necessary to understand our mistakes and develop a product development strategy. Not everyone on the team knew about the incident, but it was pretty clear that we were in big trouble.
At Google, I've run into millions of dollars of human error, but Google culture saves employees (except that engineers have to write long reports later). This time there was no Google. Our own little capital and our hard work are at stake.
The steadfast Himalayas tell us ...
I received such a blow for the first time. This could change the future of our company and my life. This incident taught me several business lessons, including the most important one - taking a hit.
At the time, I had a team of seven engineers and trainees, and it took Google about ten days to respond to us about this incident. In the meantime, we had to resume development, find a way to get around the suspension of accounts. Despite everything, we had to focus on the features and our product.
Poem "The Stalwart Himalayas Tell Us"
For some reason, one poem from my childhood was constantly spinning in my head. It was my favorite book, and I remembered it word for word, although the last time I read it more than 15 years ago.
What have we actually done?
As a very small team, we wanted to avoid spending on hardware for as long as possible. Cloud Functions and Cloud Run issue was timeout.
One instance will continually scrape URLs from the page. But after 9 minutes there will be a timeout.
Then, having casually discussed the problem, I jotted down the raw code on the blackboard in a couple of minutes. Now I realized that that code had a lot of architectural flaws, but then we were aiming for fast bug fix cycles to quickly learn and try new things.
Announce-AI concept at Cloud Run
To overcome the timeout limitation, I suggested using POST requests (with a URL as data) to submit jobs to an instance and - launching multiple instances in parallel, rather than queuing for one. Since each instance in Cloud Run will only scrap one page, there will never be a timeout, all pages will be processed in parallel (good scaling), and the process is highly optimized as Cloud Run is consumed with millisecond precision.
Cloud Run Scraper
If you look closely, the process is missing a few important details.
- A continuous exponential recursion occurs: the instances do not know when to stop because there is no break statement.
- POST requests can have the same URL. If there is a link back to the previous page, then the Cloud Run service will get stuck in an infinite recursion, but worst of all, this recursion is multiplied exponentially (the maximum number of instances was set to 1000!)
As you can imagine, this has resulted in a situation where 1000 instances are making requests and writes to Firebase DB every few milliseconds. We saw that there were about 1 billion requests per minute going through Firebase reads at one point!
GCP Month End Transaction Summary
116 billion reads and 33 million writes
The experimental version of our app on Cloud Run did 116 billion reads and 33 million writes to Firestore. Oh!
Firebase read costs:
$ (0.06 / 100,000) * 116,000,000,000 = $ 69,600
16,000 Cloud Run hours
After testing, from stopping the logs, we concluded that the request died, but in fact it went into a background process. Since we didn't uninstall the services (we were using Cloud Run for the first time and didn't really understand it then), several services continued to run slowly.
In 24 hours, all of these services on 1,000 instances ran for a total of 16,022 hours.
All our mistakes
Deploy the erroneous algorithm in the cloud
Has already been discussed above. We did find a new way to use serverless POST requests that I didn't find anywhere on the Internet, but we deployed it without specifying the algorithm.
Deploy Cloud Run with default parameters
When we created the Cloud Run service, we chose the default values ββfor it. The maximum number of instances is 1000 and the concurrency is 80 requests. We didn't know that these values ββare actually the worst-case scenario for the test program.
If we chose max-instances = 2, the cost would be 500 times less.
If we set concurrency = 1, we wouldn't even notice the score.
Using Firebase without fully understanding
You only understand something from experience. Firebase is not a language to learn, it is a container platform. Its rules are determined by a specific Google company.
Also, when writing Node.js code, you need to think about background processes. If the code goes into background processes, it is not easy for the developer to know that the service is running. As we later learned, this also caused most of the timeouts for our Cloud Functions.
Quick bugs and quick fixes are a bad idea in the cloud
The cloud as a whole is like a double-edged sword. If used correctly, it can be very useful, but if used incorrectly, blame yourself.
If you count the number of pages in the GCP documentation, you can publish several thick volumes. It takes a lot of time and deep understanding of how cloud services work to understand everything, including billing and use of functions. Unsurprisingly, it hires individual full-time employees for this!
Firebase and Cloud Run are really powerful
At its peak, Firebase handles about a billion reads per minute. This is an extremely powerful tool. We've been playing with Firebase for two or three months now - and still discovering new aspects, but until then I had no idea how powerful this system is.
The same goes for Cloud Run! If you set the number of parallel processes to 60, max_containers == 1000, then with requests of 400 ms, Cloud Run can process 9 million requests per minute!
60 * 1000 * 2.5 * 60 = 9,000,000 requests per minute
In comparison, Google search processes 3.8 million queries per minute.
Use monitoring
Although Google Cloud Monitoring will not stop billing, it sends out timely alerts (3-4 minutes delay). It's not easy to master Google Cloud terminology at first, but if you take the time, the dashboard, alerts, and metrics will make your life a little easier.
These metrics are available only for 90 days, they have not been saved with us.
We survived
Fuh, carried away
After examining our long incident report describing the situation from our side, after various consultations, conversations and internal discussions, Google forgave us the expense!
Thank you Google!
We grabbed a lifebuoy and used this opportunity to complete the product development. This time with much better planning, architecture, and much safer implementation.
Google, my favorite tech company, isn't just a great company to work with. It is also a great company to work with. Google Tools is very developer-friendly, has great documentation (for the most part), and is constantly evolving.
(Note: this is my personal opinion as an individual developer. Our company is not sponsored or affiliated with Google in any way).
What's next?
After this incident, we spent several months studying the cloud and our architecture. Within a few weeks, my understanding had improved so much that I could estimate the cost of scraping "the whole internet" with Cloud Run with an improved algorithm.
The incident forced me to deeply analyze the architecture of our product, and we abandoned the one in the first version to build a scalable infrastructure.
In the second version of Announce, we not only created an MVP, we created a platform on which we could develop new products in rapid iterations and test them thoroughly in a secure environment.
This journey took a long time ... Announcelaunched at the end of November, about seven months after the first release, but it's highly scalable, takes the best of the cloud, and is highly optimized.
We also launched on all platforms, not just the internet.
Moreover, we reused the platform to create our second product, Point Address . It also features scalability and good architecture.