How we created Data-Office





Hi, I am Ildar Raimanov and I am the head of a department at BARS Group, which is responsible for the development of BI solutions in the company. Having a wide experience in working with data, as well as possessing industry expertise, we decided to try to form a competence center, which, allowing us to process large amounts of data, will be able to provide a service for the formation of knowledge for certain subject requests of customers.



Data-Officeincludes several components at once - this is a well-developed storage that includes both a "big data lake" and prepared marts, processes for filling data from source systems, mechanisms for checking data quality; a team of methodologists who understand what these or those numbers are talking about according to industry specifics, and of course a set of various software tools, the main of which is the Alpha BI business intelligence platform developed by BARS Group.



To make the information even more understandable, I will try to reveal in simple language the key terms emphasized in the text.



If we talk in more detail about approaches and steps, then within the framework of Data-Office we have defined the following sequence:



1. Analysis of the subject area - highlighted The team of methodologists , which describes the subject area, the main entities, prepares a logical data model for the main storage .



Who are Methodologists ? These are essentially industry experts who understand the essence of data. If, for example, we are talking about finance, then these can be accountants and financiers, but if we are talking about medicine, then these are doctors and other qualified medical workers. It is their understanding that allows you to build a logical data model., namely, a set of entities that will be analyzed together with the relationships - what relation which entity can have in relation to another.



2. Based on the logical data model , a normalized physical model is prepared , data architects are connected . Here, of course, IT specialists are needed, because they are the ones who translate a set of entities into tables, create the necessary foreign keys, attributes, indexes - that is, they just build the so-called physical model .



3. A data flow model is being worked out, sources and integration options are established. A data flow model is a set of transmitted data with the described rules: from where and to where, under what conditions, with what frequency.



4. As a rule, since we are talking about a large amount of data, initially the data from the sources gets in the β€œas is” format into the data buffer - the first layer of β€œraw data” . Here, the goal is to reduce the time for loading data, and the goal is to have a set of primary data in order to preserve the ability, if necessary, to unwind the analysis chain to the very first value.



5. The issues of data transformation from the buffer to the second layer - normalized storage, as well as the frequency of updating and storing information in the buffer are being worked out, the issue of incremental updating is immediately resolved. Data quality issues , methods and tools are also being worked out . Under data qualitythe correspondence of information to the required logical content is implied. It all starts with simple format-logical control validations and ends with more complex methodological patterns.



6. Methodologists analyze consumer cases, and on the basis of this, possible data marts are described , that is, specially prepared data sets that should help to answer certain questions.

The BI development team is already directly forming a set of marts, which is an analytical data warehouse - the third layer.



7. It should be noted that in parallel work is underway on the formation of the Data Glossary(with a detailed methodological description) and constant updating of the connection between the very entities of the repository with this most detailed methodological description.



8. The toolbox during the above process may differ depending on the application. The Alpha BI business intelligence platform is mainly used, on the basis of which storage layers on PostgreSQL are built and ETL tasks are solved using the platform itself.



9. Direct work with prepared showcases also goes through Alpha BI. Receiving the need for knowledge acquisition - initially, the team of methodologists analyzes the task and imposes it on the existing logical model, then the team of BI-developers, having received a subject-oriented setting, implements the necessary selections, OLAP-Cubes, dashboards, reports on the basis of showcases. It happens that the showcase is somewhat transformed, or a new one is created, if the situation requires it.



If we talk about tools and big data, we cannot fail to mention the experience of working with the fashionable "BigData" in the kitchen for several years now, Hadoop - a layer for storing a large amount of raw historical data.


From a technical point of view, the interaction of Alpha BI with Hadoop is carried out through a layer built on the basis of the massively parallel analytical DBMS Greenplum using the PXF (Platform Extension Framework) protocol.



Similarly, using Greenplum, the possibility of online analysis and work with hot data is implemented, which, for example, is updated every 10 seconds. In the case of hot data, interaction through Greenplum is built with the in-memory Apache Ignite database also using the PXF protocol.



At the end of the day, data from the Ignite table is transferred to HDFS and removed from Ignite.



Summing up, I would like to summarize once again - the data should work and be useful. In order to extract knowledge from them as much as possible, attention should be paid to all of the above aspects: to competently approach the construction of a storage, to determine the optimal data flows, to understand the subject area of ​​"numbers", to select a tool for the task.



At the same time, of course, it is worth paying special attention to the formation of the Team and its segmentation into different types of tasks, in each of which like-minded professionals should work.



And then your data, with its millions and billions of lines and terabytes of memory, will really begin to work, give knowledge, and therefore be useful!



I will be glad to answer your questions in the comments!)



All Articles