😋 🤵 🏳️ How a company should choose tools for data engineers and not turn everything into a technology zoo: the experience of PROFI.RU 🤒 🤦🏾 🔶

The editor of Netology talked with the team lead of the BI team at Profi.ru Pavel Sayapin about what tasks data engineers solve in his team, what kind of tools they use for this, and how, after all, the right approach to the choice of tools for solving data problems, in including atypical. Pavel is a teacher at the Data Engineer course .

- Profi.ru

Profi.ru is a service that helps clients and specialists of various fields meet. The service database contains more than 900 thousand specialists in 700 types of services: tutors, repairmen, coaches, beauty masters, artists and others. More than 10 thousand new orders are registered every day - all this gives about 100 million events per day. Keeping this amount of data in order is impossible without professional data engineers.

Ideally, a Data Engineer develops a data culture that can help a company generate additional revenue or reduce costs. It brings value to the business by working in a team and acting as an important link between various participants - from developers to business consumers of reporting. But in each company the tasks may differ, so we will consider them using the example of Profi.ru.

— -, -,

The data should be clear for decision making and easy to use. You don't need to put in effort to find a description or write a complex SQL query that takes into account many different factors. An ideal picture - the user looks at the dashboard and is satisfied with everything. And if there is not enough data in some aspect, then it goes to the database and with the help of a simple SQL query gets what it needs.

Place of the Data Quality process in the general structure of the data warehouse

Explanatory documentation on working with data is of great importance. This simplifies the work of both the data engineer (not distracted by questions) and the data user (he can find the answers to his questions himself). In Profi.ru such documents are collected on the internal forum.

Convenience includes the speed of data retrieval. Speed = availability in one step, click - dashboard. But in practice, everything is more complicated.

From the point of view of the end user of the dashboard, the same Tableau does not allow displaying all possible dimensions. The user is content with the filters that the dashboard developer made. This creates two scenarios:

The developer makes many cuts for the dashboard ⟶ the number of pages increases a lot. This reduces the availability of data: it becomes difficult to understand where what lies.
The developer only creates key cuts. It is easier to find information, but for a slightly less standard cut, you still have to go either to the database or to the analysts. Which is also bad for accessibility.

Accessibility is a broad concept. This is the availability of data in the proper form, and the ability to get information on dashboards, as well as the necessary data cut.

Accumulate data from all sources in one place

Data sources can be internal or external. For example, someone's business depends on weather reports to be collected and stored - from external sources.

It is necessary to store information with an indication of the source, and also so that the data can be easily found. Profi.ru solves this problem with the help of automated documentation. YML files are used as documentation for internal data sources.

Make dashboards

Data visualization is best done in a professional tool like Tableau.

Most make decisions emotionally - visibility and aesthetics are important. The same Excel for visualization, by the way, is not very suitable: it does not cover all the needs of data users. For example, a product manager likes to dig into numbers, but in a way that is convenient to do. This allows him to solve his problems, and not think about how to get information and collect metrics.

High-quality data visualization makes decisions easier and faster.

The higher a person is in position, the more urgent is the need to have aggregated data on hand, on the phone. Top managers do not need details - it is important to control the situation as a whole, and BI is a good tool for this.

An example of a product dashboard Profi.ru (one of the sheets). For confidentiality purposes, the names of metrics and axes are hidden

Examples of real tasks

Task 1 - transfer data from source (operating) systems to a data warehouse or ETL

One of the routine tasks of a data engineer.

For this can be used:

self-written scripts run by cron or using a special orchestrator such as Airflow or Prefect;
Open source ETL solutions: Pentaho Data Integration, Talend Data Studio and others;
proprietary solutions: Informatica PowerCenter, SSIS and others;
cloud solutions: Matillion, Panoply and others.

In a simple version, the task is solved by writing a YML file of 20 lines. It takes about 5 minutes.

In the most difficult case, when you need to add a new source - for example, a new database - it can take up to several days.

In Profi, this simple task - with a streamlined process - consists of the following steps:

Find out from the customer what data is needed and where it is located.
Understand if there is access to this data.
If there is no access, ask the admins.
Add a new branch to Git with the issue code in Jira.
Create a migration to add data to the anchor model through an interactive Python script.
Add download files (YML file with a description of where the data is taken from and in which table it is written).
Test at the stand.
Upload data to the repository.
Create a pull request.
Go through the code review.
After passing the code review, the data is uploaded to the master branch and automatically rolled into production (CI / CD).

Task 2 - conveniently place the downloaded data

Another common task is to place the downloaded data so that the end user (or BI tool) is comfortable working with it and does not have to make unnecessary movements for most tasks. That is, build or update a Dimension Data Store (DDS).

For this, the solutions from the 1st task can be applied, since this is also an ETL process. In its simplest form, the DDS is updated using SQL scripts.

Task 3 - from the category of atypical tasks

Streaming analytics is born in Profi. A large number of events from product teams are generated - we record them in ClickHouse. But you cannot insert records one by one in a large number, so you have to combine records into batches. That is, you cannot write directly - you need an intermediate handler.

We use the Apache Flink based engine. So far, the procedure is as follows: the engine processes the incoming stream of events ⟶ adds them in batches in ClickHouse ⟶ on the fly counts the number of events in 15 minutes ⟶ sends them to the service, which determines if there are anomalies - compares with the values for the same 15 minutes with a depth of 3 months ⟶ if there is, sends a notification to Slack.

Front-end analytics schema (download part)

Apache Flink framework guarantees delivery at least once. However, there is a chance of duplicates. In the case of RabbitMQ, this can be solved using the Correlation ID. Then a single delivery ⟶ data integrity is guaranteed.

We count the number of events again using Apache Flink, display it through a self-written dashboard written in NodeJS, + front in ReactJS. A quick search yielded no similar solutions. And the code itself turned out to be simple - it didn't take long to write.

The monitoring is rather technical. We look at anomalies in order to prevent problems in the early stages. Some significant global metrics of the company are not yet included in the monitoring, since the direction of streaming analytics is at the stage of formation.

Essential Data Engineer Tools

With the tasks of data engineers, it is more or less clear, now a little about the tools that are used to solve them. Of course, the tools in different companies can (and should) differ - it all depends on the amount of data, their speed of receipt and heterogeneity. It may also depend on the specialist's bias to one particular instrument just because he worked with it and knows it well. Profi.ru settled on such options →

For data visualization - Tableau, Metabase

Tableau was chosen for a long time. This system allows you to quickly analyze large amounts of data, while not requiring costly implementation. For us, it is convenient, beautiful and familiar - we often work in it.

Few people know about Metabase, meanwhile it is very good for prototyping.

From the visualization tools, you can also say about Airbnb's Superset. Its special feature is many database connections and visualization capabilities. However, for an ordinary user, it is less convenient than Metabase - you cannot join tables in it, for this you need to create separate views.

In the same Metabase, you can join tables, moreover, the service does it itself, taking into account the database schema. And Metabase's interface is simpler and nicer.

There are a lot of tools - just find yours.

For data storage - ClickHouse, Vertica

ClickHouse is a free fast tool for storing grocery events. On it, analysts themselves make separate analytics (if they have enough data) or data engineers take aggregates and re-upload them to Vertica to build storefronts.

Vertica is a cool user-friendly product for end-store display.

For data flow control and computation - Airflow

We load data through console tools. For example, through a client that comes with MySQL, it turns out faster.

The advantage of console tools is speed. Data is not pumped through the memory of the same Python process. On the downside, there is less control over the data that transits from one database to another.

The main programming language is Python

Python has a much lower threshold of entry + the company has competence in this language. Another reason is that under Airflow DAGs are written in Python. These scripts are just a wrapper around the downloads, most of the work is done via console scripts.

We use Java for real-time analytics development.

An approach to choosing data tools - what to do in order not to breed a technology zoo

There are many tools on the market for working with data at every stage: from its appearance to output to a dashboard for the board of directors. Not surprisingly, some companies may have a number of unrelated solutions - the so-called technology zoo.

The tech zoo are tools that do the same thing. For example, Kafka and RabbitMQ for messaging or Grafana and Zeppelin for visualization.

Map of technologies and companies in the field of data and AI - you can see how many duplicate solutions can be

Also, many can use different ETL tools for personal purposes. In Profi, this is exactly the situation. The main ETL is on Airflow, but someone uses Pentaho for personal uploads. They test hypotheses, and they don't need to run this data through engineers. Basically, self-service tools are used by fairly experienced specialists who are engaged in research activities - exploring new ways of product development. The set of their data for analysis is mainly of interest to them, moreover, it is constantly changing. Accordingly, it makes no sense to add these loads to the main platform.

Returning to the zoo. Often the use of duplicate technologies is associated with a human factor. Separate internal teams are used to working with one or another tool that another team may not use. And sometimes autonomy is the only way to solve special problems. For example, an R&D team needs to test something with a certain tool - it's just convenient, someone on the team has already used it, or there is another reason. It's a long time to wait for the resource of system administrators to install and configure this tool. At the same time, thoughtful and meticulous administrators still need to prove that this is really necessary. So the team installs the tool on their virtual machines and solves their specific tasks.

, . , .

Another common reason for the emergence of new tools is the desire to try an unknown product in a fairly new area, where standards have not yet been formed or there are no proven guidelines. A data engineer, like a developer, must always research new tools in the hope of finding a better solution to current problems or to keep abreast of what the market has to offer.

The temptation to try new tools is really great. But in order to make the right choice, you need first of all self-discipline. It will help you not to give yourself completely to research impulses, but to take into account the company's capabilities in supporting the infrastructure for a new tool.

Don't use technology for technology's sake. It is best to approach the question pragmatically: a task is a set of tools that can solve this task.

And then evaluate each of them and choose the best one. For example, this tool can solve a problem more efficiently, but there are no competencies for it, and this one is slightly less effective, but the company has people who know how to work with it. This tool is paid, but easy to maintain and use, and this is fashionable open source, but it needs a staff of admins to support it. Such dichotomies arise that require a cool head to resolve.

Choosing an instrument is half a leap of faith, half a personal experience. There is no complete certainty that the tool will fit.

For example, in Profi they started with Pentaho, because they had expertise in this instrument, but in the end it turned out to be a wrong decision. The internal Pentaho repository became very slow as the project grew. By the way, it took a minute to save data, and if there is a habit of constantly saving work, then time simply slipped through my fingers. To this was added a complex launch, scheduled tasks - the computer hung.

The suffering ended after switching to Airflow, a popular tool with a large community.

The presence of a community service, a tool is important for solving complex problems - you can ask colleagues for advice.

If the company is mature and has the resources, it makes sense to consider buying tech support. This will help you quickly troubleshoot problems and receive recommendations for using the product.

If we talk about the approach to choice, then Profi adhere to the following principles:

Don't make a decision alone . When a person chooses something, he is automatically convinced that he is right. It's another matter to convince others when there is a strong defense to be made. This also helps to see the weaknesses of the instrument.
Seek advice from the Chief Data Officer (vertical dialogue) . This can be the Chief Data Engineer, the head of the BI team. Tops see the situation more broadly.
Communicate with other teams (horizontal dialogue) . What tools do they use and how well. Perhaps a colleague's tool can solve your problems and you don't have to arrange a zoo of solutions.

Internal competencies as an effective replacement for an external service provider

The approach to the choice of tools can also be considered the use of the company's internal competencies.

Quite often there are situations when a business has a difficult task, but there is no money to implement it. The task is large and important, and in a good way it is best to involve an external service provider who has the relevant experience. But since there is no such opportunity (money), an internal team is assigned to solve the problem. In addition, usually a business trusts its employees more if they have already proven their effectiveness.

Examples of such tasks, when a new direction is being developed by employees, is carrying out load testing and creating a data warehouse. Especially the data warehouse, as it is a unique story for every business. The storage cannot be bought, you can only hire external specialists who will build it with the support of an internal team.

By the way, as the new direction develops, the team may realize that the need for an external service provider has disappeared.

In Profi, the implementation of BI was in-house. The main difficulty was that the business wanted to launch BI quickly. But it took time to build such a project: build up competencies, fill in data, build a convenient storage scheme, select tools and master them.

The main - hot - phase, when everything was built and crystallized, lasted for about a year. And the project is still developing.

When building a corporate data warehouse, it is important to adhere to high standards, defend your position and not do it somehow to please the business.

It was with great pain that we redid most of the project, which had to be done quickly then.

But sometimes a quick approach is advisable. So, in product development, it may even be the only correct one. You need to move forward quickly, test product hypotheses, and more. But the storage must be based on a solid architecture, otherwise it will not be able to quickly adapt to the growing business and the project will die out.

In this difficult project, our leader helped a lot, who defended the progress of the work, explained to the management what we were doing, knocked out resources and simply protected us. Without such support, I'm not sure that we would have been able to launch the project.

In such stories, an important role is played by the so-called early adopters - those who are ready to try new things - among top managers, analysts, product managers. For a crude topic to take off, we need pioneers who will confirm that everything works and is convenient to use.

If anyone wants to share the solution to the third problem described above - welcome :-)

How a company should choose tools for data engineers and not turn everything into a technology zoo: the experience of PROFI.RU