🐧 🏐 🤦🏿 MySQL: Execution Cannot Be Pardoned 🤾🏻 🛠️ 🕟

The website and online store "Eldorado" is about 40 thousand purchases every day. There is probably no need to explain what this means for the company's business.

Historically, the store runs on the Bitrix engine with a huge amount of custom code and add-ons. The storage is a MySQL cluster with four master servers.

A significant number of companies have monolithic applications, and many have to work with them. There are plenty of ways to deal with the monolith, but, unfortunately, few people write about successful ones. I hope that the story about how we prop up our monolith (until we saw it) will be of interest to you.

We are well aware that massive architecture can bring a lot of problems. But it is easy to destroy it, by a simple strong-willed decision, it is impossible: sales are going, the site must work, changes for users must not be radical. Therefore, the transition from a monolith to a set of microservices takes time, which we need to hold out: to ensure the system is operational and its resistance to stress.

What was the problem

For a very long time, the database cluster on the eldorado.ru website was built according to the following scheme:

All masters in this scheme work simultaneously and all are in active mode, sequentially playing the replication stream ... M1-> M2-> M3-> M4-> M1 -> M2-> M3-> M4-> M1-> M2 ...

In our setup, this configuration gave the only plus - it allowed the system to work and keep its load. The fact is that the application request balancer, after any update to ensure consistency, switches the entire read stream to this master, and one master was once not enough to hold the entire read stream.

But such a scheme could not provide either reliability or speed of work. Although it looked simple, it had a number of flaws. She was very slow to update data in the cluster: in the worst case, there were up to five replication arms (depending on which master the changes were initially initiated on). Due to these delays, many problems arose both in the operation of the site and when placing orders in the online store.

Cons of this scheme :

The slaves of the zone farthest from the active master receive data updates in the worst case only after 4 times the transaction execution time, sometimes there were frenzied replication delays;
Any failure on any of the masters leads to data inconsistency throughout the cluster until it is eliminated;
(- — );
;
, ( , , );
UPDATE/DELETE SELECT ;
, slave_status seconds_behind_master.

You can emulate this behavior in your test environments by enabling an artificial replication delay of 1-2 seconds on the slaves (which we did), this is a great way to test your application for readiness for such distributed architectures via the MASTER_DELAY = N option.

And finally, the transition to another database in our case is not an option, because the system is too large-scale and much in it is tied to the use of MySQL features and even to the nuances of the work of its internal query optimizer.

How we solved it

We didn’t want to change the schema on our own (this is a risky operation) and started by looking for a consulting company that could propose and deploy a new architecture, and do it so that the site remains accessible and the switch is not noticeable. Among these companies were the largest integrators and software developers.

Some companies simply did not answer us (and this is normal), while others wrote that they were not ready to take on such a task. Moreover, the question of the possible cost of the project did not even arise.

If big integrators do not want to get involved in a task, then it has become doubly interesting to solve it themselves. We had good equipment at our disposal, problems with stability and fault tolerance were in the second priority, and at the very beginning we wanted to somehow speed up the data transfer. Several options that appeared in MySQL versions 5.6 and 5.7 worked well for this.

True, the documentation transparently hinted that it would not be possible to simply enable them, tk. in the ring there will definitely be a slave with a smaller version, but here it is like this:

The 5.7 master is able to read the old binary logs written prior to the upgrade and to send them to the 5.7 slaves. The slaves recognize the old format and handle it properly.

Binary logs created by the master subsequent to the upgrade are in 5.7 format. These too are recognized by the 5.7 slaves.

In other words, when upgrading to MySQL 5.7, the slaves must be MySQL 5.7 before you can upgrade the master to 5.7.

For tests, it was enough for us to raise the test ring, for example, through mysqld_multi, and run typical queries on it (you can even run on the same host, 4 instances on different ports with different offsets), something like this:

mysql -h127.0.0.1 -P 3302 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"
mysql -h127.0.0.1 -P 3302 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3301, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master1-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"
mysql -h127.0.0.1 -P 3303 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"
mysql -h127.0.0.1 -P 3303 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3302, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master2-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"
mysql -h127.0.0.1 -P 3304 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"
mysql -h127.0.0.1 -P 3304 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3303, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master3-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"
mysql -h127.0.0.1 -P 3301 -e "RESET SLAVE; RESET MASTER; SHOW MASTER STATUS\G;"
mysql -h127.0.0.1 -P 3301 -e "CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3304, MASTER_USER='root', MASTER_PASSWORD='', MASTER_LOG_FILE='master4-binlog.000001', MASTER_LOG_POS = 107;START SLAVE;"

After that, you can change the version of any instance by running the config with the desired port with another binary, which was put next to it, and making mysql_upgrade for the data / system tables.

On sale, we could, for a limited time, send all traffic to only one master, update another at this time, then switch to it and re-update all the others. But for this, it was necessary to ensure the compatibility of the binlog format between versions, so that absolutely all transactions were lost successfully.

A little more documentation useful for our case:

To avoid incompatibilities, set the following variables on the MySQL 5.6 master:

binlog_checksum=NONE
binlog_row_image=FULL
binlog_rows_query_log_events=OFF
log_bin_use_v1_row_events=1 (NDB Cluster only)

This flag turned out to be mandatory for us, although we do not use any NDB, without it, replication ended between servers 5.6 - 5.5, and mysqlbinlog reads the log without this option with the error ERROR: Error in Log_event :: read_log_event (): 'Sanity check failed', data_len : 8203, event_type: 30, if enabled, then everything even starts and works.

We did not include the GTID, because in addition to the requirement for compatibility with all old tools, objectively we do not see enough advantages for the transition.

gtid_mode=OFF

The simplest test to check the correctness of replication is to upload a dump in turn to the server with 5.5 and the server with 5.6 and see if everything will be ok.

Unfortunately, although expectedly, the tests were unsuccessful.

Last_Error: Column 18 of table 'eldorado2.b_adv_banner' cannot be converted from type '<unknown type>' to type 'datetime'

datetime in 5.6 is special, microseconds are added to it, so in 5.6 it has a new datetime, unknown in 5.5

version 5.6 - it can work in a cluster in a ring in parallel with 5.5, if at the same time no fields are created in any of the tables that run through replication with new types of fields. (datetime 5.6! = datetime 5.5, similar to time, timestamp, there are more than 240 such fields in our database).

We could not completely guarantee the absence of DDL with these fields, and we did not want to jeopardize the performance of the entire cluster. But we had a safer Plan B.

It meant the presence of additional hardware for maneuvers and raising a complete copy of the cluster nearby, fortunately, we had such hardware. And since there is such a possibility, then it was necessary to create a "normal" cluster at once.

But, at the same time, it is necessary to ensure the preservation of the performance of all current monitoring, debugging, and binlog analysis tools and to eliminate as much as possible all the existing shortcomings of the current architecture.

Multichannel replication

A silver bullet is needed to keep the sites intact and the admins to be fed. This is multichannel replication. We a priori did not trust new opportunities and were not sure about the technology, we could not find such information or cases anywhere, there is little public experience in large production.

Therefore, we thought of everything ourselves, the plan was as follows:

: 5.7, ;
;
, — , , .

— , , , . , , ? . , « »! , - !

( « »)

In the target scheme, 4 masters are 4 independent recording streams for each slave, which are processed independently.

On all masters, it was now possible to turn off log_slave_updates - they simply do not need to relay anything anywhere, they send all changes in the main stream (=> even lower the load on the master).

And at the same time, you can also enable the minimum binlog format and parallel processing of transactions along the way (quite conditionally, you need to understand it correctly):

slave_parallel_workers=5
slave_parallel_type=LOGICAL_CLOCK
binlog_row_image=minimal

With this setup, we could switch the load to a new cluster at any time, but this route is one-way and does not provide for rollback.

For the lifetime of connections to the old cluster, the log_slave_updates option on one master entry point to the new cluster is still there, and therefore all changes from the connections to the "old" cluster are perfectly delivered to the new one, and immediately after they die off this option was disabled, the application to this moment looked at 3 other masters and data streams did not intersect in any way.

As a result, we got the following set of advantages:

If a long request is blocking something, then this affects only one out of 4 replication threads and does not affect the others in any way;
The new binlog format, which was previously impossible just because of the MySQL version, now takes up several times less space and, accordingly, traffic, due to this, much more changes can pass through the entire cluster;
Now you can turn off any master absolutely painlessly without affecting everything else;
The accidents of masters are now not so scary, now you can clone any server in a minute in any incomprehensible situation, regenerate the server_id, create credits for slave access and - the new master is ready.

There is also a "minus":

Each master has much more slaves and it is easier to run into the channel (in fact, this is not an increase in traffic, but a redistribution in time and space).

What did the new scheme give

The move to the new scheme turned out to be successful, we carried it out in one day, on August 28, 2020. The experience of using the new architecture has shown that the number of replication problems has been reduced by three to four times (it is impossible to completely get rid of them). The stability of the system has increased. And the main result was an increase in the maximum throughput of the system. If earlier developers could blame any incomprehensible problems on replication, now it doesn't work for them.

The number of client problems caused by replication delays has been reduced by several times, which means that clients are experiencing at least a little less pain. Now we can turn off any master server at any time in order to carry out work on it. This does not affect the entire cluster or stop the replication process.

The cluster serves the main site "Eldorado" - for the most part, an old monolithic application with product cards, a personal account, a basket, order processing, a call center, etc. At the time of this writing, the total read load on the cluster (only on the slaves) is 40k rps, approximately 5k rps per database server, excluding the technical load on individual technical slaves, which is significantly higher at peak times. It may not seem like a lot, but one must take into account the nature and complexity of these queries.

We really hope that someone will find our experience useful. In addition to multichannel replication, we also use many interesting things, such as blackhole and federated tables, they also allow you to remove a lot of headaches (and add a little for those who do not understand why they are needed), if someone is interested in the nuances and any other questions about to our MySQL - Wellcome in the comments.

For half a year of industrial operation, we have not yet encountered any problems related to multichannel and we can definitely recommend such a setup as sufficiently fault-tolerant and reliable.

The monolith itself is now in the process of cutting into a number of separate and independent services, some of which we will shard and tell you about our experience - stay tuned.

Special thanks to my excellent team, we would not have done this without her!

PS By the way, we still really need talented programmers . If you are like that, come , it will be interesting.

MySQL: Execution Cannot Be Pardoned

What was the problem

How we solved it

Multichannel replication

What did the new scheme give

More articles: