💮 🤦🏾 💑 How we organized highly efficient and inexpensive DataLake and why it is so 🙎 ◻️ 🚱

We live in an amazing time when you can quickly and easily dock several ready-made open-source tools, configure them with “disabled consciousness” according to stackoverflow advice, without delving into “multi-letters”, and launch them into commercial operation. And when you need to update / expand or someone accidentally reboots a couple of machines - to realize that some kind of obsessive bad dream has begun in reality, everything has become dramatically complicated beyond recognition, there is no turning back, the future is foggy and safer, instead of programming, breed bees and do cheese.

Not for nothing, more experienced colleagues, with their heads strewn with bugs and from this already gray-haired, contemplating the incredibly fast deployment of packs of "containers" in "cubes" on dozens of servers in "fashionable languages" with built-in support for asynchronous-non-blocking I / O - modestly smile ... And they silently continue to re-read "man ps", delve into the source code of "nginx" until bleeding from their eyes and write-write-write unit tests. Colleagues know that the most interesting will be ahead, when "all this" one night becomes a stake on New Year's Eve. And only a deep understanding of the nature of unix, a learned TCP / IP state table and basic sort-search algorithms will help them. To bring the system back to life under the chimes.

Oh yes, I was a little distracted, but I hope I managed to convey the state of anticipation.

Today I want to share our experience of deploying a convenient and inexpensive stack for DataLake, which solves most of the analytical tasks in a company for completely different structural divisions.

Some time ago, we came to understand that the company needs more and more the fruits of both product and technical analytics (not to mention the cherry on top in the form of machine learning) and to understand trends and risks - more and more needs to be collected and analyzed. more metrics.

Basic technical analytics at Bitrix24

Several years ago, simultaneously with the launch of the Bitrix24 service, we actively invested time and resources in creating a simple and reliable analytical platform that would help us quickly see infrastructure problems and plan the next step. Of course, it was desirable to take the tools ready-made and as simple and understandable as possible. As a result, nagios was chosen for monitoring and munin for analytics and visualization. Now we have thousands of checks in nagios, hundreds of charts in munin and colleagues every day and use them successfully. The metrics are clear, the graphs are clear, the system has been working reliably for several years and new tests and graphs are regularly added to it: we put a new service into operation - we add several tests and graphs. Good luck.

Hand on the pulse - advanced technical analytics

The desire to receive information about problems "as quickly as possible" led us to active experiments with simple and understandable tools - pinba and xhprof.

Pinba sent us in UDP packets statistics on the speed of parts of web pages in PHP and it was possible to see online in the MySQL storage (pinba comes with its own MySQL engine for quick event analytics) a short list of problems and respond to them. And xhprof in automatic mode allowed collecting execution graphs of the slowest PHP pages from clients and analyzing what could lead to this - calmly, pouring tea or something stronger.

Some time ago, the toolkit was supplemented with another fairly simple and straightforward engine based on the reverse indexing algorithm, perfectly implemented in the legendary Lucene library - Elastic / Kibana. The simple idea of multi-threaded writing of documents to an inverse index of Lucene based on events in the logs and quickly searching through them using facet division turned out to be, indeed, useful.

Despite the rather technical kind of visualizations in Kibana with low-level concepts like “bucket” "flowing up" and the newly invented language of not yet forgotten relational algebra, the tool began to help us well in the following tasks:

How many PHP errors did the Bitrix24 client have on the p1 portal in the last hour, and which ones? Understand, forgive and fix quickly.
- 24 , /?
( C PHP), ? segfaults?
PHP? : «out of memory»? .

Here's a concrete example. Despite careful and multi-level testing, the client, with a very non-standard case and corrupted input data, had an annoying and unexpected error, a siren sounded and the process of its quick fix began:

Additionally, kibana allows you to organize notification of specified events and in a short time has become a tool in the company use dozens of employees from different departments - from technical support and development to QA.

The activity of any division within the company has become convenient to track and measure - instead of manual analysis of logs on servers, it is enough to set up parsing of logs and sending them to the elastic cluster once, to enjoy, for example, contemplating in the kibana dashboard the number of sold two-headed kittens printed on 3-d printer for the last lunar month.

Basic Business Intelligence

Everyone knows that business intelligence in companies often starts with extremely active use, yes, yes, Excel. But, the main thing is that it does not end there. Cloud Google Analytics adds fuel to the fire - you quickly get used to good things.

In our harmoniously developing company, “prophets” of more intensive work with larger data began to appear here and there. The need for deeper and more multifaceted reports began to appear regularly, and thanks to the efforts of guys from different departments, a simple and practical solution was organized some time ago - a combination of ClickHouse and PowerBI.

For quite a long time, this flexible solution helped a lot, but gradually it began to come to the understanding that ClickHouse is not rubber and cannot be mocked like that.

Here it is important to understand well that ClickHouse, like Druid, like Vertica, like Amazon RedShift (which is based on postgres), are analytical engines optimized for fairly convenient analytics (sums, aggregations, minimum-maximum per column and a little bit of joins ), because are organized to efficiently store columns in relational tables, unlike MySQL and other (row-oriented) databases we know.

In fact, ClickHouse is just a more capacious "database" of data, with not very convenient point insertion (as intended, everything is ok), but nice analytics and a set of interesting powerful functions for working with data. Yes, you can even create a cluster - but you understand that hammering nails with a microscope is not entirely correct, and we began to look for other solutions.

The demand for python and analysts

There are many developers in our company who write code almost every day for 10-20 years in PHP, JavaScript, C #, C / C ++, Java, Go, Rust, Python, Bash. There are also many experienced system administrators who have survived more than one absolutely incredible disaster that does not fit into the laws of statistics (for example, when most disks in raid-10 are destroyed by a strong lightning strike). In such conditions, for a long time it was not clear what a "python analyst" is. Python is like PHP, only the name is slightly longer and the traces of the mind-altering substances are slightly smaller in the interpreter's source code. However, as more and more analytical reports were created, experienced developers began to realize more and more the importance of narrow specialization in tools like numpy, pandas, matplotlib, seaborn.

The decisive role was most likely played by the sudden fainting of employees from the combination of the words "logistic regression" and the demonstration of effective reporting on large data using yes, yes, pyspark.

Apache Spark, its functional paradigm, relational algebra and its capabilities have made such an impression on developers accustomed to MySQL that the need to strengthen the battle ranks with experienced analysts became clear as day.

Further attempts by Apache Spark / Hadoop to take off and what went wrong

However, it soon became clear that with Spark, apparently, something systemically is not quite right, or you just need to wash your hands better. If the Hadoop / MapReduce / Lucene stack was made by fairly experienced programmers, which is obvious if you look at the source code in Java or Doug Cutting's ideas in Lucene with passion, then Spark, suddenly, is written in a very controversial from the point of view of practicality and now not developing exotic Scala language. And the regular drop in calculations on the Spark cluster due to illogical and not very transparent work with allocating memory for reduce operations (many keys arrive at once) - created a halo of something around it that has room to grow. Additionally, the situation was aggravated by a large number of strange open ports, temporary files,growing in the most incomprehensible places and the hell of jar-dependencies - which caused the system administrators one and well-known feeling from childhood: fierce hatred (or maybe it was necessary to wash your hands with soap and water).

As a result, we “survived” several internal analytical projects that actively use Apache Spark (including Spark Streaming, Spark SQL) and the Hadoop ecosystem (and others and so on). Despite the fact that over time we learned to cook and monitor "it" well and "it" practically stopped suddenly falling due to the change in the nature of the data and the unbalancing of the uniform RDD hashing, the desire to take something ready-made, updated and administered somewhere in the cloud grew stronger and stronger. It was at this time that we tried to use the ready-made cloud-based assembly of Amazon Web Services - EMR and, subsequently, tried to solve problems on it. EMR is an Apache Spark prepared by Amazon with additional software from the ecosystem, similar to Cloudera / Hortonworks builds.

"Rubber" file storage for analytics - an urgent need

The experience of "cooking" Hadoop / Spark with burns to various parts of the body was not in vain. The need to create a single, inexpensive and reliable file storage that would be resistant to hardware failures and in which it would be possible to store files in different formats from different systems and in which it would be possible to make efficient and timely selections for reports from this data began to emerge more and more clearly.

I also wanted the software update of this platform not to turn into a New Year's nightmare with reading 20-page Java traces and analyzing kilometer-long detailed cluster logs using Spark History Server and a backlit magnifying glass. I wanted to have a simple and transparent tool that does not require regular diving under the hood, if the developer stopped executing a standard MapReduce request when the reduce data worker dropped out of memory with a poorly chosen partitioning algorithm for the initial data.

Amazon S3 a DataLake Candidate?

Experience with Hadoop / MapReduce taught that you need a scalable, reliable file system and scalable workers above it, "coming" closer to the data, so as not to drive data over the network. Workers should be able to read data in different formats, but, preferably, not read unnecessary information and so that data can be stored in advance in formats convenient for workers.

Once again, the main idea.There is no desire to "upload" big data into a single cluster analytical engine, which will sooner or later drown and have to be ugly shard. I would like to store files, just files, in an understandable format and perform effective analytical queries on them with different but understandable tools. And there will be more and more files in different formats. And it's better to shard not the engine, but the initial data. We need an expandable and versatile DataLake, we decided ...

What if we store files in the familiar and well-known scalable Amazon S3 cloud storage without having to make our own chops from Hadoop?

It is clear that the data is "bottom", but other data if you take it out and "drive it effectively"?

Cluster-bigdata-analytic ecosystem of Amazon Web Services - in very simple words

Judging by our experience with AWS, it has been actively used there for a long time under various Apache Hadoop / MapReduce sauces, for example, in the DataPipeline service (I envy my colleagues, they learned how to cook it correctly). Here we have configured backups from different services from DynamoDB tables:

And they have been regularly performed on the built-in Hadoop / MapReduce clusters like clockwork for several years. Set it up and forget it:

You can also effectively engage in datasatanism by raising Jupiter laptops for analysts in the cloud and using AWS SageMaker for training and deploying AI models into battle. Here's how it looks with us:

And yes, you can pick up a laptop in the cloud for yourself or analytics and attach it to the Hadoop / Spark cluster, calculate and then "nail" everything:

Really convenient for individual analytical projects and for some we have successfully used the EMR service for large-scale calculations and analytics. What about a system solution for DataLake, will it work? At that moment we were on the verge of hope and despair and continued our search.

AWS Glue - neatly packaged Apache Spark "on steroids"

It turned out that AWS has its own version of the Hive / Pig / Spark stack. The role of Hive, i.e. the catalog of files and their types in DataLake runs the “Data catalog” service, which does not hide its compatibility with the Apache Hive format. In this service you need to add information about where your files are located and in what format they are. The data can be not only in s3, but also in the database, but this is not about that in this post. Here's how the DataLake data directory is organized here:

The files are registered, great. If the files have been updated, we launch either by hand or on a schedule by crawlers, who will update information about them from the lake and save them. Further, the data from the lake can be processed and the results can be unloaded somewhere. In the simplest case, we upload it to s3 too. Data processing can be done anywhere, but it is suggested to set up processing on an Apache Spark cluster using advanced capabilities through the AWS Glue API. In fact, you can take the good old and familiar python code using the pyspark library and configure it to run on N nodes of a cluster of some capacity with monitoring, without digging in the guts of Hadoop and dragging docker-mocker containers and eliminating dependency conflicts.

Once again, a simple idea.You do not need to configure Apache Spark, you just need to write python code for pyspark, test it locally on the desktop and then run it on a large cluster in the cloud, indicating where the source data is and where to put the result. Sometimes it is necessary and useful, and this is how it is configured here:

Thus, if you need to calculate something on the Spark cluster on the data in s3, we write the code in python / pyspark, test it and have a good trip to the cloud.

What about the orchestration? What if the task fell and disappeared? Yes, it is proposed to make a beautiful pipeline in the style of Apache Pig and we even tried them, but decided to use our deeply customized orchestration in PHP and JavaScript for now (I understand, there is a cognitive dissonance, but it works for years and without errors).

Lake file format is key to performance

It is very, very important to understand two more key points. In order for requests for data from files in the lake to be executed as quickly as possible and performance does not degrade when new information is added, you need:

Store the file columns separately (so that you do not need to read all the lines to understand what is in the columns). For this, we took the parquet format with compression
It is very important to shard files by daddies in the spirit: language, year, month, day, week. Engines that understand this type of sharding will look only at the right daddies, without shoving all the data through themselves.

In fact, in this way, you lay out in the most efficient form the initial data for the analytical engines that are hung from above, which can selectively enter and read only the necessary columns from files into sharded daddies. There is no need to go anywhere, it will turn out to "fill" the data (the storage will simply burst) - just put it sensibly into the file system in the correct format right away. Of course, it should be clear here that storing a huge csv file in DataLake, which must first be read by the cluster line by line, in order to extract the columns, is not very advisable. Think about the above two points again if it is not yet clear why all this is.

AWS Athena - "hell" out of the snuff box

And then, while creating the lake, we, somehow in passing, stumbled upon Amazon Athena. Suddenly it turned out that by neatly folding our huge log files by shards-daddies in the correct (parquet) columnar format, you can very quickly make extremely informative selections on them and build reports WITHOUT, without Apache Spark / Glue cluster.

The s3 data engine Athena is based on the legendary Presto , a member of the MPP (massive parallel processing) family of data processing approaches, taking data where it lies, from s3 and Hadoop to Cassandra and plain text files. You just need to ask Athena to execute the SQL query, and then everything "works quickly and by itself." It is important to note that Athena is "smart", goes only to the necessary sharded daddies and reads only the columns needed in the request.

Billing requests to Athena are also interesting. We pay for the amount of data scanned . Those. not for the number of machines in the cluster per minute, but ... for the data actually scanned on 100-500 machines, only the data necessary to fulfill the request.

And by requesting only the necessary columns from properly sharded daddies, it turned out that the Athena service costs us tens of dollars a month. Well, great, almost free, compared to analytics on clusters!

By the way, this is how we shard our data in s3:

As a result, in a short time, completely different departments in the company, from information security to analytics, began to actively make requests to Athena and quickly, in seconds, receive useful answers from the “big” data for rather long periods: months, half a year, etc.

But we went further and began to go to the cloud for answers via an ODBC driver : an analyst writes a SQL query in a familiar console, which, on 100-500 machines, “for a penny” wool data in s3 and returns an answer usually in a few seconds. Conveniently. And fast. I still can't believe it.

As a result, having made a decision to store data in s3, in an efficient columnar format and with reasonable data sharding by daddies ... we got DataLake and a fast and cheap analytical engine - for free. And he became very popular with the company because understands SQL and runs orders of magnitude faster than starting / stopping / configuring clusters. "And if the result is the same, why pay more?"

The request to Athena looks something like this. If desired, of course, you can form enoughcomplex and multi-page SQL query , but we will limit ourselves to simple grouping. Let's see what response codes the client had a few weeks ago in the logs of the web server and make sure that there are no errors:

conclusions

Having gone, not to say that a long, but painful path, constantly adequately assessing the risks and the level of complexity and cost of support, we have found a solution for DataLake and analytics that never ceases to please us both with speed and cost of ownership.

It turned out that even experienced developers who never work as architects and cannot draw squares on squares with arrows and who know 50 terms from the Hadoop ecosystem are able to build an efficient, fast and cheap DataLake for the needs of completely different divisions of the company.

At the beginning of the journey, my head was splitting from the set of the wildest zoos of open and closed software and the understanding of the burden of responsibility to descendants. Just start building your DataLake from simple tools: nagios / munin -> elastic / kibana -> Hadoop / Spark / s3 ..., collecting feedback and deeply understanding the physics of the processes taking place. Everything complicated and muddy - give it to your enemies and competitors.

If you do not want to go to the cloud and like to maintain, update and patch open source projects, you can build a similar scheme to ours locally, on low-cost office machines with Hadoop and Presto on top. The main thing is not to stop and go forward, count, look for simple and clear solutions and everything will definitely work out! Good luck to everyone and see you soon!

How we organized highly efficient and inexpensive DataLake and why it is so