Hello, Habr! My name is Ivan Kizimenko, I am Head of Analytics at Outside. In this article, I would like to talk about how and why we developed our own custom analytics system instead of Google Analytics. Actually, we worked not only with analytics from Google, but also other companies, including, for example, Adobe.
But one good day (and it really was not bad) we decided: “Enough to endure this! It's time to make your own system. " Under the cut - about the reasons for this decision, the features of the custom system and a number of other interesting things for many of Habr's readers.
Why do we need all this?
We work with fairly large customers with global solutions that have very different analytics systems. Someone has Adobe, someone has a basic version of Google Analytics, someone has 360.
But global solutions have a number of problems:
- These are closed systems with a number of restrictions on access to functionality.
- They have minimal customization options.
- If something can be improved, then it is very expensive and, moreover, time consuming.
- Analytics cover their own solutions without considering local developments.
Since Google Analytics is perhaps the most popular analytics system, let's list its disadvantages first:
- It takes a lot of effort to build custom reports and dashboards.
- In the reports of the system, data is always aggregated, and this process is completely or mostly out of control.
- , , , -. .
- .
- API.
- — 24-48 , , .
- Google Analytics 20. — 200.
- — . , , IP .
In general, we do not scold Google Analytics - not at all. This is an excellent system, but it suits a limited, albeit rather voluminous, range of tasks. But as soon as a specific task arises, such as calculating indicators for a year and comparing with the past period, problems begin. Somewhere the markup does not cover the request at all, somewhere it fell off. Well, or the way of writing the event has changed, so you have to "start" Python, sort the data by days and weeks and generate at least minimally close to reality data.
Other systems also have problems, including Adobe Analytics. So, if you need to connect tools that are not available in Adobe, then it is long and expensive. In this case, it will be economically unprofitable to build a system based on Adobe that will cover all the needs of the company's web analytics. A significant drawback of the system is that it covers only global projects.
Problems sometimes arise where you don't expect them. For example, we needed a report for the last year, we wanted to "pull" them out of Google Analytics, but it didn't work out. It turned out that someone from the headquarters managers went into the counter settings and changed the settings, so that all the historical data we needed was simply deleted.
In the end, we get basic things in rather complex systems. These are traffic, leads, bounce rate (BR).
The day everything changed
Everything would be fine, but there is one big problem - the bounce rate, an indicator that most specialists pay attention to. Of course, everyone wants to lower it. If you wish, it is very simple to do this - a timer is “hung up” for 25 seconds, the event goes away, and we get a decrease in the percentage of failure. But this is only formally, in reality everything remains the same as before.
If you change the counting method, for example, add 2 browser scroll events to the seconds, then BR grows, and dubious sources from CPA and Programmatic begin to show an 80% failure rate. After changing the methodology, it is almost impossible to calculate the percentage of refusals for a year during which several methods were used.
There are other problems, including a large amount of work, an increase in the number of Python scripts and CRON tasks, complex reports are difficult to build, and end-to-end analytics is not available. In general, we decided to revise the approach to both analytics and markup. To begin with, we decided on the critical needs:
- Gaining access to raw data.
- The ability to transfer any amount of additional information.
- The ability to do Backup.
- Using auto-markup.
- The ability to connect the database to BI-tools.
The best option is to build your own data collection system, which can be easily modified and customized as soon as possible. Well, we used Clickhouse as data storage. It is clear that your own analytics system must meet a number of criteria that we formulated at the very beginning:
- The ability to store any amount of data on your own servers. You can also restrict access to data if necessary.
- Collect any number of events and parameters, which provides much more data for analysis.
- The ability to integrate with services such as CRM, CAllTracking, BI, instant messengers, advertising systems, etc.
- Cross-platform user tracking by cookie and userid.
- Create your own attribution models.
- Working with Open Source and no security issues.
- Timely feedback from the site, sending events from analytics to the site right during the user's session.
- Adaptation of an analytical solution to the requirements of a specific business.
According to these criteria, we have assembled our own analytics system, which we now use. As for the data, it is stored in its raw form in ClickHouse. By the way, another advantage of the custom analytics system is that there are no restrictions on the number of parameters. This immediately opened up an opportunity, for example, to get more information. We now know, for example, which colors, options, engines and other parameters are very popular in car configurators.
What about visualization?
As a BI tool, we first tried metabase, which did not work, because it works strangely with caches and is difficult to customize. We chose Apache Superset and here's why:
- It is developing rapidly. Every month there are some new improvements and updates.
- You can make your own charts and visualizations, for example, Echarts, which has a huge number of possibilities in its arsenal.
- Customizing the appearance of dashboards in CSS - it is quite easy to build a dashboard for the corporate identity of the brand.
- Good research tool. You can quickly go in to sketch a SQL query, get an answer.
- It is possible to send reports to Email and Slack
- Access control out of the box. For each graph, dataset, whatever, you can configure access to specific users.
Several dozen dashboards are now available with different access levels for different tasks. This includes dashboards for baseline traffic metrics, dashboards for product team / teams, and for other agencies.
After the system was adopted, the processing speed of non-standard questions increased significantly. It is now a set of SQL queries via SQLab in Superset. And also - attention - you can connect others, databases, only partially employed in web analytics.
Let's not forget about automation
To solve routine tasks, we used Apache Airflow 2. This allowed us to upload data and generate reports in a single tool. Ultimately, it was possible to aggregate data from other sources and other agencies to form the final reports.
After some time, it was possible to develop an analytics system that allows not only creating services that respond to events in real time, but also poison e-mail and Push, plus send Webhooks or send signals directly to the browser.
The custom system, among other advantages, allows you to easily connect to any third-party IP.
Under the hood, the system has its own analytics - framework for Python and Type / JavaScript, which simplify and speed up the development process. There is also a launching and control system, plus an embedded development environment based on Theia IDE, as well as a Jupiter Lab for quick research and prototyping. Finally, Grafana is a visualization tool and simple dashboards.
As an example of the work of the resulting system, one can cite its use for calculating the AB-test of the AVN of the Skoda company. Custom analytics made it possible to:
- Get data from Clickhouse.
- Process information at Pythone + Clickhouse.
- Store calculated data in CSV.
- Set a timer to start every 6 hours.
- The ability to stop or rebuild or customize the service, since it works independently of other systems.
It sounds a little complicated, but it can be explained with an example. Previously, when a user leaves a request to buy a car, only the buyer's phone number reached the sales manager. So the specialist has to call the client again and ask for everything that has already been filled in earlier.
Now the manager, in addition to the application, receives summary information about the visitor, including the latter's interest in a certain car brand, configuration, etc. And, by the way, the manager receives personalized information. Plus, the manager understands better what a person needs and what he can offer.
As a conclusion, it should be said that in the end everything worked out - a custom analytics system works for the benefit of customers without the threat of losing data in six months or even a year. Data analysis with subsequent visualization takes a minimum of time. Well, if you need to add and remove something, you can do it very quickly with minimal costs.