In the beginning, we will briefly describe what the process looked like before and after we taught our internal product A1 to programmatically collect datasources and publish them to Tableau Server. Then we will take a closer look at the problem of the BI command and the solution found, and also take a look under the hood (here about creating a .hyper file, publishing a file on a tableau server and updating a hyper). Welcome to the cat!
We in the DAN advertising group work a lot with advertising monitoring data from Mediascope , an industrial meter on the media market. There are different scenarios: some employees upload raw data, others use ready-made preprocessed databases, and someone orders the development of automated dashboards based on this data. Let's talk about the last scenario in more detail - our BI developers collect dashboards in Tableau, but before starting to "draw", they also need to bring the data to the desired format that is convenient for development.
The life path of data from raw materials to beautiful automated graphs can be roughly divided into 4 steps:
- Getting raw data
- Data cleaning and revision
- Creating data sources for Tableau
- Development of visualizations
It was
Before we learned how to programmatically generate data sources for Tableau, the process looked like this:
1. Getting raw data
Users generate tabular reports through the internal tool A1. We will talk about it in more detail below.
2. Data cleaning and modification The data
transformation capability is also included in the A1 tool, after which the cleaned data can be uploaded to xslx / csv and continue to work with them outside the tool. It is worth noting here that some users limit themselves to the 1st point and, after uploading the reports, modify the data on their own.
3. Creating data sources for Tableau
Previously, dashboard customers came with a set of Excels, which they generated in the previous paragraphs. And BI developers brought these ex-works into a single datasource (tabloid slang) on their own. It was not always possible to limit ourselves only to Tableau tools; they often wrote scripts in Python.
4. Development of visualizations
Finally, the tip of the iceberg is creating a dashboard and publishing it on the Tableau Server, where the customer can see it. In practice, rendering often takes less time than collecting data and setting up updates.
The pain accumulated in the third step, as the number of custom solutions grew, which were expensive to maintain and implement. Also, errors in the data from the second step were regularly leaking out - the intermediate Excel between the two systems (A1 and Tableau) was pushing the user: “let's correct something with pens, no one will notice”.
Has become
The main task was to eliminate ex-fur between 2 and 3 steps. As a result, we taught A1 to collect datasources and publish them to Tableau Server. Here's what happened:
Now steps 1 through 3 take place in A1, at the output the BI team receives a datasource published on the Tableau Server for developing visualizations. The Hyper API became the connecting link, which will be discussed further.
results
Reduced the number of nodes when working in different tools. Now it is more difficult to make a mistake somewhere in the process, and it is easier to catch where the error occurred, the investigation of failures takes less time. The system warns users about common errors.
Free up the time of the BI team . Previously, there were few template solutions and a lot of customizations. Most often, Python processing was added for each project. In rare cases, where processing was not needed, we worked directly in Tableau Desktop (the main development tool).
Now the preparation of the datasource is: click on the required fields in the A1 interface, mark which of them we expand into lines (if necessary) and optionally define the type of fields in advance.
We don't load Tableau Serverupdating bulky datasources - the update is done by A1, and the ready-made hyper is downloaded to the server.
* Bonus - we encourage users to work inside A1. If earlier some users, after unloading raw reports, modified them manually outside the tool, now, since the whole process from steps 1 to 3 takes place in A1, it is easier for users to set up the cleaning process there.
Problem and solution
A little about A1
Before we start talking about our solution, we need to talk about our internal product A1, to which we have attached the generation of datasources.
A1 is an internal product of the company, which is designed to simplify the workflow for employees whose main job is as follows:
- Retrieve data from MediaScope software products
- Bring (clean) this data into a form convenient for subject analysts
- If necessary, prepare data for creating dashboards (we will talk about this today)
After the user finishes cleaning the data, they are stored in the A1 system. In our terminology, this is called "Container". A container is a regular document in MongoDB, which we need to transfer to the Tableau server.
The BI team problem
Our BI development team needed to somehow get data from A1, which was stored in MongoDB, and build dashboards based on the data received. First of all, we tried to fetch data from MongoDB using regular scoreboard tools, but this did not solve the problem:
- Since the data is stored in MongoDB, data with an arbitrary structure is received at the entrance to the scoreboard, which means that you would constantly have to maintain this logic.
- To aggregate data from MongoDB, it was necessary to drag certain records from the collection, and not the entire collection - the Tableau driver does not know how to do this.
- Among other things, it was not enough to get the data: sometimes it had to be "expanded" - to "unpivot" some columns into rows. Which, too, was not so easy to do, from the word at all.
What have we come up with
It was decided to try to solve this problem with my bike, using the Tableau Hyper API library . This library allows you to create a file in the .hyper format, into which it is easy to add data, and then use it as a data source to create a dashboard on the server board.
As the developers of the scoreboard describe the hyper themselves:
Hyper is a high-performance in-memory data engine that helps customers quickly analyze large or complex datasets by efficiently evaluating database queries. Based on the Tableau platform, Hyper uses proprietary dynamic code generation techniques and advanced concurrency technologies to achieve high performance in extracts and queries.The approximate process of work in our program is as follows:
- User selects containers and desired columns
- The system pulls data out of containers
- Based on the received data, the system determines the types of columns
- The creation of the hyper and the insertion of data into it is initialized
- The hyper is loaded on the scoreboard server
- BI developers see the hyper on the server and create a dashboard based on it
When new data is poured into the containers, the system will be given a signal that the hyper needs to be updated:
- The system will download the hyper from the scoreboard server
- Will take fresh data from MongoDB and update the hyper
- After the system uploads a new hyper to the server, overwriting the existing one
- The user just needs to click on the "Refresh" button to display up-to-date information in the dashboard
What the user sees
As stated earlier, A1 is a web application. We used Vue.js and Vuetify to create a front-end hyper generation service.
The application interface is divided into three screens.
On the first screen, the user selects the desired containers and columns.
If the "Unpivot" option is enabled, then two additional columns will be created in the hyper: variable - the names of the columns that are selected by the Metrics column and values - the values from these columns.
The Dimension column adds a column with the selected column of the same name to the hyper. The number of selected columns Dimensions and their names must be the same in all containers so that the integrity of the table in the hyper is not violated, therefore there is a column "Hyper name", which allows you to specify the name of the selected column, if they are named differently in containers.
This concludes the process of setting up the hyiper. The user just needs to go to the second screen, click "Create hyper" and watch the progress of events in the logs.
The third screen contains additional settings:
- You can turn on ignoring updates if we do not need the system to automatically update the hyper
- You can specify an email to send update reports
- You can manually specify the data type for the values column (used only in unpivot mode): float, string, or automatically determined by the system (we'll talk about types further)
- You can also specify data types for selected columns in containers.
What's under the hood
A1 is written in Python. To work with data, we use Pandas, and we serialize the data from pandas to pickle and store it in MongoDB GridFS.
When a command to create a hyper is received, the system performs the following operations:
- Unloads all necessary containers from MongoDB and deserializes data into pandas datafremes
- Prepares data: leaves only the required columns in dataframes, gives them new names, expands tables if necessary via pandas.melt
- If the user has set the data type for the columns, then convert the data either to float32 or to string
- After all the preparatory work with the data, the system creates a file via hyper api and sends the file to the scoreboard server via tabcmd.
It's worth talking a little about the data types of columns. One of the features of storing data in A1 containers is that users do not bother about what types to assign to columns, pandas does it for them perfectly: the system calmly copes with situations when numbers and string values are present in a column. However, the hyper doesn't like this: if you tell him that the column must be of type int, the system will swear when trying to insert anything other than an integer. Therefore, it was decided to use only two data types in hypers: string and float.
So, we figured out the general principle of work, let's talk about working with hyper itself.
Creating a .hyper file
To work with the Hyper API, you need to install the library, you can download it from the official website here . There are also some good examples of how to work with this tool. We will briefly indicate the main points.
The hyper file itself is a kind of database, somewhat reminiscent of SQLite. Through the api, you can access data using like SQL syntax:
f"SELECT {escape_name('Customer ID')} FROM {escape_name('Customer')}"
Since our system is written in Python, we will also use the library for the corresponding language. When creating the file, we must specify the schema name, table name and columns with types. The name of the schema and the table should be called “Extract”, since it is in this schema with the table that Tableau Server climbs to extract data for books.
with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
with Connection(
hyper.endpoint, self.fullpath_hyper, CreateMode.CREATE_AND_REPLACE
) as connection:
connection.catalog.create_schema("Extract")
main_table = TableName("Extract", "Extract")
example_table = TableDefinition(main_table)
After creating the table, we need to create columns and set types. As we said earlier, our data has only two types (float or string), so based on the type of columns in the dataframe, we set this for the columns:
for column in dataframe.columns:
if dataframe[column].dtype.name in ("category", "object"):
example_table.add_column(TableDefinition.Column(column, SqlType.text()))
elif dataframe[column].dtype.name in ("float32"):
example_table.add_column(
TableDefinition.Column(column, SqlType.double())
)
connection.catalog.create_table(example_table)
After creating the table, you can insert data:
with Inserter(connection, example_table) as inserter:
for val in dataframe.values:
inserter.add_row(val.tolist())
inserter.execute()
Here we run through the dataframe line by line and accumulate the list with values via inserter.add_row () . In fact, there is a function add_rows () in the api hyper , which takes a list of lists and inserts the values already. Why wasn't this done? To save RAM: in order to provide a list of lists of values from the dataframe, you need to ask pandas to do values.tolist () . And when you have 150 million lines of data, it turns out to be a very expensive operation for the RAM, while this does not affect performance in any way (in any case, it was not noticed that due to iterative iteration over the lines, the speed of creating a hyper somehow sank). Plus, add_rows ()works like syntactic sugar: it actually takes a list of lists and adds data iteratively.
This concludes the creation of our hyper. Next, we need to publish it on the server.
Publishing a file to a tableau server
To access the tableau server, we will use the tabcmd utility - this is a console utility that allows you to connect to the server and perform administrative functions - create users, groups, books, and so on.
We will run the tabcmd command through the Python subprocess.Popen:
popen = subprocess.Popen(
f'/opt/tableau/tabcmd/bin/tabcmd publish "{fullpath_hyper}" -n "{filename}" -o -r "A1_test" '
'-s http://tableau.domain.com -u "username" -p "password" --no-certcheck',
shell=True,
stderr=subprocess.PIPE,
stdout=subprocess.PIPE,
)
return_code = popen.wait()
if return_code:
error = str(popen.communicate()[1])
return f" . {error}"
We pass the following command and keys to tabcmd:
- publish : upload a file to the server
- -n (--name) : what file name will be on the server
- -o (--overwrite) : if there is a file with this name, then overwrite
- -r “A1_test” (--project): ( )
- -s (--server): tableau-
- -u -p:
- --no-certcheck: SSL-
We figured out how to create a new hyper, but what to do when the hyper consists of ten containers and one of them received new data? We will update the hyper.
When new data arrives in the container, the system looks to see if there are any hypers who use this container. If there is, then the task is to update the hyper.
To understand what data from which container is in the hyper, the system also creates an additional container_id column when creating the hyper. With this approach, updating becomes very simple:
- We take the file from the server
- We delete all lines in the hyper, where container_id is equal to the updated container
- Insert new lines
- Upload the overwrite file back to the server.
The process of retrieving a file is slightly different from the process of downloading it. First of all, we will not take the .hyper file from the server, but the .tdsx archive, which we will then unpack and open the .hyper itself.
In order to pick up the file, we use tabcmd:
popen = subprocess.Popen(
f'/opt/tableau/tabcmd/bin/tabcmd get "datasources/{filename_tdsx}" '
f'-s http://tableau.domain.com -u "username" -p "password" '
f'--no-certcheck -f "{fullpath_tdsx}"',
shell=True,
stderr=subprocess.PIPE,
stdout=subprocess.PIPE,
)
return_code = popen.wait()
if return_code:
error = str(popen.communicate()[1])
return f". {error}"
Here we use the following command and keys:
- get : get a file from the server. If the test.hyper file is on the server, then you need to refer to the test.tdsx file, and they are all in the datasource directory (I could not google why such a feature of work in the scoreboard, if you know, share in the comments)
- -f (--filename) : full path, including file name and extension, where to save the file
After the file has been downloaded, it must be unzipped via a zipfile:
with zipfile.ZipFile(fullpath_tdsx, "r") as zip_ref:
zip_ref.extractall(path)
After unzipping, the hyper will be in the ./Data/Extracts directory .
Now that we have the current version of the file, we can remove unnecessary lines from it:
table_name = TableName("Extract", "Extract")
with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
with Connection(hyper.endpoint, self.fullpath_hyper) as connection:
connection.execute_query(
f"DELETE FROM {table_name} WHERE "
f'{escape_name("container_id")}={container_id}'
).close()
Well, inserting and publishing a file has already been described above.
Conclusion
What's the bottom line? Having done the work on the implementation of the generation of hyper-files and their automatic delivery to the tableau-server, we significantly reduced the load on the BI-team, it became easier to update the data in the dashboard and, most importantly, faster. The very acquaintance with hyper api was not painful, the documentation is well written, and the very integration of the technology into our system was easy.
We thank you for your attention! If you have any questions or comments, please leave them in the comments.
The article was written jointly with Vasily Lavrov (VasilyFromOpenSpace) — -