Overview of modern data analytics tools

image



I will clarify right away that there are a lot of types of analysts, since you can analyze anything you like. These are web analysts, classical data scientists, business analysts, financial analysts, as well as product, system and UX analysts. The reason for this diversity, apparently, is that in a number of large companies, dozens or even hundreds of programmers and analysts can simultaneously work on the creation of one platform or product. In such conditions, a strong narrowing of specialization occurs.



All of these types of analysts use their own specific sets of tools. Therefore, I will focus only directly on the field of data analysis, outside the context of the origin of this very data. Thus, we exclude from the review the systems of web analytics, CRM, ERP, warehouse accounting systems, logistics and document management.



1. Programming languages



We will not dwell on exceptional, unique or rare occasions. Let's consider only the most popular ones. And of course, first of all, it is the python language.



Python



Python serves as the primary tool in the hands of data scientists, is not strongly typed, and is designed for rapid prototyping or short scripting or scripting. People who understand programming and computer science often criticize him for the fact that algorithms written in pure python are not optimal in terms of their performance and memory requirements.



Nevertheless, this programming language has many advantages. Among them, I would note that python is taught almost everywhere, and therefore it is relatively easy to find an analyst who knows python. The second advantage is libraries for working with data and machine learning, which have a user-friendly interface. For example, sklearn makes it easy to build preprocessing and model building pipelines. All machine learning algorithms and settings are encapsulated inside classes and objects, which makes the code very simple.



R



Until recently, the main competitor of python was the R language. Requests for knowledge of R and now rarely appear in job descriptions, at least in the "benefits" section. Until mid-2018, I myself was programming in R. And while trying to automate some of my machine learning work, I almost reinvented the wheel, trying to create pipelines for data preparation and model training in R. A little later I learned that such pipelines have long existed in the sklearn library and are called pipeline.



C ++, C #



If the existing python libraries are not enough and you need to implement a new algorithm with high performance, the compiled and statically typed C ++ language or a similar language C # is at your service.



Matlab



The MatLab language is built into the software package of the same name and an interactive environment for engineering calculations. True, this language is intended to a greater extent for solving technical problems, and not for performing financial or business analysis. For example, I was lucky to use MatLab twice: in the process of studying acoustic emission signals in structures, as well as in the processing of human speech.



There are a number of machine learning libraries with APIs for other programming languages ​​like Java, JavaScript, Scala, etc. But I will not dwell on them since the purpose of the article is slightly different.



Please be patient a little. You will learn about everything in the following sections.



2. AutoML and visual designers



AutoML, according to its basic idea, dramatically simplifies the task of the researcher and reduces several steps of studying and preparing data, constructing features, choosing and comparing a machine learning algorithm and tuning hyperparameters to one single step. And this step is to select and configure one big box called AutoML. The result of running the AutoML algorithm is a constructed and appropriately configured and trained pipeline. It remains only to take the "raw" data, slip it into the pipeline and wait for the result in the form of forecasts at the output.



A box called "AutoML" looks like either a machine learning library or a web service where data is uploaded.



If this is a library, then it differs from sklearn in that our usual code of 20-30 lines is compressed to 5 lines. A famous example of such a library is H2O.



Another example is the MLBox library. You can find stories about her on the Internet, about how the use of MLBox allowed her to get into the top 5% in kaggle competitions.



Now a few words about AutoML cloud services. Firstly, all major digital giants are in a hurry to present their technical solutions. Some of them are: Google AutoML Tables, Azure Machine Learning (Microsoft), SageMaker Autopilot (Amazon). The listed services should be of interest primarily to those companies that develop analytical systems on cloud platforms. It is very convenient when the data infrastructure, computing resources, and ready-made machine learning algorithms are provided by the same provider. The integration is truly seamless.



In addition to digital giants, smaller players are entering the AutoML market. For example, Bell Integrator is currently actively working on the neuton.ai platform.



In the same section, it is worth remembering machine learning systems that occupy an intermediate position between direct programming in R and Python and fully packaged AutoML. These are the so-called workflow constructors. Two typical examples are Microsoft's Azure Machine Learning Designer and Sberbank's SberDS platform.



The constructor is a set of bricks from which you can assemble the entire machine learning pipeline, including the final check of the model's health. This is undoubtedly a beautiful solution for people with a visual mindset who are comfortable with representing the process of machine learning and model testing in the form of diagrams.



3. BI tools



Here I would like to review several BI solutions in the field of analytics: Power BI, Tableau, Qlick Sense, Qlick View and Excel.



Power BI



Power BI is a set of analytics tools from Microsoft that are available as desktop apps and cloud services. There are corporate solutions that work on the company's closed IT infrastructure. Working in Power BI Desktop or Power BI Services requires no coding skills. There is a possibility of online integration with external data sources, as well as downloading data in csv format.



Power BI is able to solve machine learning problems using AutoML, that is, you do not have to write code like in Python to build a classification or regression model. In addition to the standard tasks of analyzing tabular data, the functionality includes technologies for sentiment analysis, extraction of key phrases, language recognition and adding tags to an image.



Tableau



Tableau is also a whole family of online and desktop applications, just like Power BI. These applications have a simple visual interface and allow you to work with the drag-and-drop method. Beautiful charts are built in just a few clicks. You can also analyze the data in a tabular form and apply various filters to it.



Tableau allows you to solve machine learning problems such as regression, time series forecasting, cluster analysis. Most importantly, Tableau is able to integrate with external scripts in R and Python. It turns out an easily extensible tool.



Qlick Sence and Qlick View



Qlick Sence and Qlick View differ in positioning and interface, but in essence and in problem solving algorithms they are built on the same engine. Qlick View is an enterprise platform run by it specialists, Qlick Sence is a tool for personal use without the need to seek help from tech. support.



At the first acquaintance, the "beauty" and ease of visualization are striking. This is the tool for building an eye-catching management dashboard. From my point of view, the ability to change the scale when analyzing geographic maps and clusters on two-dimensional graphs looks especially spectacular. I recall shots from films, where in the photo from satellites they are trying to make out the license plate of the car or to distinguish a person from the crowd on the square.



Another interesting option is the presence of a mobile application for performing analysis from a smartphone. This is how the top manager of the retail network appears, hurrying to the next flight at the airport and receiving an unexpected message in the messenger with a link to the dashboard.

Qlick Sence integrates with Python and therefore machine learning.



Excel



You will forgive me, but I could not pass by Excel. No matter how much you laugh, any instrument is good in its own way. For example, Excel pivot tables and graphs are beautifully built in just a few clicks. Combined with a convenient spreadsheet and csv formatting, it is quite a good tool.



4. A highlight on the cake. AI-based automatic code generation



Once, when I met on the net, I was asked the question "do you program in python?" And when I answered β€œYes,” the sequel was completely unexpected.



"Do you know about this ..." and then there was a link to the video on Youtube

https://www.youtube.com/watch?v=fZSFNUT6iY8&t=4s&ab_channel=FazilBabu .



This is a generative text model from OpenAI, trained on the GitHub repository. Specific examples demonstrate the ability of the model to generate Python code based on the function title and its brief description.



But what if such a model can be trained well on scripts by data scientists? This is a question for thought ...



All Articles