How we rake a zoo of 5 placements in data centers

It all started with a sistemnik in the dormitory of Moscow State University and regular hosting, which hosted our train schedule, which many of you have seen. And shifting files at night to meet load limits. Then the first servers appeared. They received triple traffic in May, which is why they immediately went to bed. More precisely, we do not know what kind of traffic they received, because they went to triple from the usual.







Looking back, I can say that all placement activities since then are forced moves. And only now for the fifteenth year we can configure the infrastructure as we need.



Now we are standing in 4 physically different data centers connected by a dark optics ring, placing 5 independent resource pools there. And it so happened that if a meteorite falls into one of the crossovers, then 3 of these pools will immediately fall off, and the remaining two will not pull the load. So we started a complete rebalancing to put things in order.



First data center



At first there was no data center at all. There was an old systemist in the dormitory of Moscow State University. Then, almost immediately - virtual hosting from Masterhost (they are still alive, devils). The traffic to the site with the train schedule doubled every 4 weeks, so very soon we switched to KVM-VPS, it happened around 2005. At some point, we ran into traffic restrictions, because then it was necessary to maintain a balance between incoming and outgoing. We had two installations, and we transferred a pair of weighty files from one to another every night to maintain the required proportions.



In March 2009 there were only VPS... This is a good thing, we decided to switch to colocation. We bought a couple of physical iron servers (one of them is the one from the wall, the body of which we store as memory). They put Fiord in the data center (and they are still alive, little devils). Why? Because it was not far from the then office, a friend recommended, and I had to get up quickly. Plus it was relatively inexpensive.



The load sharing between the servers was simple: each had a back-end, MySQL with master-slave replication, the front was in the same place as the replica. Well that is almost without division by load type. Pretty soon they also began to be missed, bought a third.



Around October 1, 2009, we realized that there are already more servers, but we will lay down for the new year... Traffic forecasts showed that the possible capacity will be cut off with a margin. And we ran into the performance of the database. There was a month and a half to prepare before the traffic growth. This was the time of the first optimizations. We bought a couple of servers purely under the database. They focused on fast disks with a rotation speed of 15krpm (I don't remember the exact reason why we did not use SSDs, but most likely they had a low limit on the number of writes, and at the same time cost like an airplane). We divided the front, back, databases, tweaked the nginx, MySQL settings, carried out a resection to optimize SQL queries. Have survived.





Now we are standing in a pair of Tier-III data centers and in Tier-II UI (with a backswing on T3, but without certificates). But Fiord was never even a T-II. They had problems with survivability, there were situations from the category "all the power wires are in one collector, and there is a fire, and the generator was driving for three hours." In general, we decided to move.



Chose another data center, Caravan... Task: how to move servers without downtime? We decided to live at two data centers for a while. Fortunately, the traffic inside the system at that time was not so much as it is now, it was possible to drive traffic over VPN between locations for some time (especially out of season). Made a balancing of traffic. Gradually increased the share of the Caravan, after a while completely moved there. And now we have one data center left. And we need two, we already understood this, thanks to the failures at Fiord. Looking back at those times, I can say that TIER III is also not a panacea, survivability will be 99.95, but availability is different. So one data center is definitely not enough for an availability of 99.95 and higher.



The second chosen Stordata, and there was already the possibility of an optical link with the Caravan site. Managed to stretch the first core. Just started to load a new data center, as the Caravan announced that they had an ass. They had to leave the site because the building is being demolished. Already. Surprise! There is a new site, they propose to extinguish everything, cranes to raise racks with equipment (then we already had 2.5 racks of iron), translate, turn it on, and it will work ... 4 hours for everything ... fairy tales ... I’m silent that we even have an hour of downtime did not fit, but here the story would have dragged on for at least a day. And all this was served in the spirit of "Everything is gone, the plaster is removed, the client is leaving!" On September 29, the first call, and on October 10, they wanted to take everything and take it. For 3-5 days we had to develop a moving plan, and in 3 stages,turning off 1/3 of the equipment at a time with full preservation of service and uptime to transport cars to Stordata. As a result, the downtime was 15 minutes in one not the most critical service.



So again we were left with one data center.



At this moment, we are tired of lugging around with servers under our arm and playing loaders. Plus tired of dealing with the hardware itself in the data center. They began to look towards public clouds.



From 2 to 5 (almost) data centers



Started looking for options with clouds. We went to Krok, tried it, tested it, agreed on the terms. We got into the cloud, which is in the Compressor data center. Made a ring of dark optics between Stordata, Compressor and office. Everywhere its own uplink and two arms of optics. Chopping any of the rays does not destroy the network. The loss of an uplink does not destroy the network. We got the LIR status, we have our own subnet, BGP announcements, we reserve the network, beauty. I’ll not describe how they went into the cloud from the point of view of the network, but there were nuances.



So we have 2 data centers.



Krok also has a data center on Volochaevskaya, they expanded their cloud there too, offered to transfer part of our resources there. But remembering the story of the Caravan, which, in fact, never recovered after the demolition of the data center, I wanted to take cloud resources from different providers in order to reduce the risk that the company would cease to exist (the country is such that it is impossible to ignore such a risk). Therefore, they did not get involved with Volochaevskaya at that time. Well, another second vendor does magic with prices. Because when you can pick up and leave flexible, it gives you a strong bargaining power on prices.



We looked at different options, but the choice fell on #CloudMTS. There were several reasons for this: the cloud on tests proved to be good, the guys also know how to work with the network (a telecom operator after all), and the very aggressive marketing policy of capturing the market, as a result, interesting prices.



Total 3 data centers.



After that, we connected Volochaevskaya as well - we needed additional resources, but Compressor was already a bit cramped. In general, we redistributed the load between the three clouds and our equipment in the Stordat.



4 data centers. And already in terms of survivability, T3 is everywhere. Not everyone seems to have certificates, but I won’t say for sure.



MTS had a nuance. Nothing but MGTS could go there last mile. At the same time, it was not possible to pull the dark optics of MGTS entirely from the data center to the data center (for a long, expensive time, and if I do not confuse, they do not provide such a service). I had to do it with a joint, output two beams from the data center to the nearest wells, where our dark optics provider Mastertel is. They have an extensive network of optics throughout the city, and, if anything, they just weld the desired route and give you a place to live. Meanwhile, the World Cup came to the city, unexpectedly, like snow in the winter, and access to the wells in Moscow was closed. We were waiting for this miracle to end, and we can throw our link. It would seem that it was necessary to leave the MTS data center with optics in hand, whistling to reach the desired hatch and lower it there. Conditionally. We did it for three and a half months. More precisely, the first ray was made quite quickly,by early August (I recall that the World Cup ended July 15). But I had to tinker with my second shoulder - the first option implied that we had to dig the Kashirskoye highway, for which we had to block it for a week (there was a tunnel in the reconstruction where some communications lay, we had to dig out). Fortunately, we found an alternative: another route, the same geo-independent. It turned out two veins from this data center to different points of our presence. The optics ring has turned into a ring with a handle.It turned out two veins from this data center to different points of our presence. The optics ring has turned into a ring with a handle.It turned out two veins from this data center to different points of our presence. The optics ring has turned into a ring with a handle.



Running a little ahead, I will say that they put it to us anyway. Fortunately, at the very beginning of the operation, when little was transferred. A fire broke out in one well, and while the installers were swearing in foam, in the second well someone pulled out a connector to look at (it was somehow of a new design, I wonder). Mathematically, the likelihood of a simultaneous failure was negligible. In fact, we caught him. Actually, we were lucky in Fiord - the main power supply was cut there, and instead of turning it back on, someone confused the switch and turned off the backup line.



There were not only technical requirements for distributing the load between locations: there are no miracles, and an aggressive marketing policy with good prices implies a certain rate of growth in resource consumption. So we kept in mind all the time what percentage of resources must be sent to MTS. We redistributed everything else between other data centers more or less evenly.



Your iron again



The experience of using public clouds has shown us that it is convenient to use them when you need to quickly add resources, for experiments, for a pilot, etc. When used under constant load, it turns out more expensive than twisting your own iron. But we could no longer abandon the idea of ​​containers, seamless migrations of virtual machines within a cluster, etc. They wrote automation to extinguish some of the cars at night, but still the economy did not work out. We didn't have enough expertise to support a private cloud, so we had to grow it.



We were looking for a solution that would allow you to get a cloud on your hardware relatively easily. At that time, we never worked with Cisco servers, only with a network stack, this was a risk. Dellah has simple, familiar hardware, as reliable as a Kalashnikov assault rifle. We have had this for years, and still have somewhere. But the idea behind Hyperflex is that it supports hyper-convergence of the final solution out of the box. And at Della everything lives on ordinary routers, and there are nuances. In particular, performance in fact is not as cool as in presentations due to overhead. I mean, they can be set up correctly and it will be super, but we decided that this is not our business, and let Dell be prepared by those who find this a vocation. As a result, we chose Cisco Hyperflex. This option won in aggregate as the most interesting: less hemorrhoids in setup and operation,and during the tests everything was fine. In the summer of 2019, we launched the cluster into battle. We had a half-empty rack in the Compressor, occupied for the most part only with network equipment, and we placed it there. Thus, we got the fifth “data center” - physically four, but five by the resource pools.



They took, calculated the volume of the constant load and the volume of the variable. They turned the constant into a load on their iron. But so that at the hardware level it gives the cloud advantages in resiliency and redundancy.



The payback period of the iron project is at the average prices of our clouds for the year.



You are here



At this moment, we ended our forced moves. As you can see, we didn’t have much economical options, and constantly we loaded where we had to get up for some reason. This has led to a strange situation that the load is uneven. The failure of any segment (and the segment with Krok's data centers is held on two Nexus in a bottleneck) is a loss of user experience. That is, the site will continue, but there will be obvious difficulties with accessibility.



There was a failure in MTS with the entire data center. There were two more in the others. Periodically, clouds fell off, or cloud controllers, or some kind of complex network problem arose. In short, we lose data centers from time to time. Yes, for a short time, but still unpleasant. At some point, they took for granted that the data centers were falling off.



We decided to go for data center-level fault tolerance.



Now we will not go to bed if one of the 5 data centers fails. But if we lose Croc's shoulder, there will be very serious drawdowns. And so the project of data center resiliency was born. The goal is this - if the DC dies, the network dies before it or the equipment dies, the site should work without any intervention by hand. Plus, after the accident, we have to recover regularly.



What are the pitfalls



Now:





Need to:





Now:





Need to:





Elastic is resistant to the loss of one node:





MySQL databases (many small ones) are difficult to manage:







My colleague, who did the balancing, will write about this in more detail. It is important that before we hung this, if we lost the master, then we had to go to the reserve with our hands and put the flag r / o = 0 there, rebuild all the replicas to this new master with ansible, and there are more than two of them in the main garland dozens, change application configs, then roll out configs and wait for updates. Now the application walks over one anycast-ip, which looks at the LVS balancer. The permanent config does not change. All base topology on the orchestrator.



Now, dark optics are stretched between our data centers, which allows us to access any resource within our ring as a local one. The response time between the data centers and the time inside the plus or minus is the same. This is an important difference from other companies that are building geoclusters. We are very much tied to our hardware and our network, and we do not try to localize requests inside the data center. This is cool on the one hand, and on the other hand, if we want to go to Europe or China, we won’t pull out our dark optics.



This means rebalancing almost everything, primarily the databases. There are many schemes, when the active master holds the entire load for both reading and writing, and next to it there is a synchronous replica for fast switching (we do not write in two at once, but we replicate, otherwise it does not work very well). The main base is in one data center, and the replica is in another. There can also be partial copies in the third for individual applications. There are 10 to 15 such instances depending on the time of year. Orchestrator is a stretched cluster between data centers and 3 data centers. Here we will tell you more in more detail when you have the strength to describe how all this music plays.



You will need to dig into the applications. This is now necessary: ​​sometimes it happens that if the connection is broken, it’s correct to pay off the old, open a new one. But sometimes requests are repeated in an already lost connection in a loop until the process dies. The last thing that was caught was the task for the crown, a reminder about the train was not written out.



In general, there is still a lot to do, but the plan is clear.



All Articles