Boxes, mustaches and violins





Very often data needs to be compared. For example, we have several series of data from some area of ​​human activity (industry, medicine, government, ...), and we want to compare how similar they are or, conversely, how some indicators stand out in comparison with others. For ease of perception, let's take data that is simpler, universal and neutral - the height at the withers and the weight of several dog breeds according to the American Kennel Club. Average rock size data can be found here... Add in the random.uniform function from the Python numpy library, convert inches to centimeters and pounds to kilograms, and now we have a realistic looking multi-breed dog size dataset to work with. In our example, these are Chihuahuas, Beagles, Rottweilers and English Setters.







One analyst you can use to compare these 4 series of numbers is to look at their median. It splits the data series into two parts: half of the values ​​are less than the median and the other half are greater. We find the median values ​​by grouping by the breed column using the pandas library and applying the median function to the grouped data. Similarly, you could look at other statistics: mean (mean) and mode (mode).



We see that half of the Chihuahuas we met have a height at the withers no more than 18 cm, the beagle is much higher - in the region of 41 cm, and the next in size are the Rottweiler and the English setter, which differ slightly in height: 58 and 63 cm.







Figure 2. Median withers height values ​​for four dog breeds.

But only one median is not enough for comparative analysis of data. You can get more information by looking at a tool such as a box-and-whiskers plot using the Python seaborn plotting library. The line inside the box is the familiar median. Its level on the graph on the right (see Figure 3) coincides with the height of the corresponding column on the left. But at the same time, the swing chart contains additional information about how the data is distributed within the row: the lower border of the rectangle (box) is the first quartile (a value that exceeds 25% of the values ). And those very "mustaches" are segments,extending up and down from the middle of the rectangle are built on the basis of the interquartile range and indicate the upper and lower bounds of the significant part of our data, excluding outliers. There are no outliers (we did not come across dystrophics and giant dogs), if there were they would be displayed as labels outside the "mustache".







Figure 3. Comparison of bar and range charts plotted for the same dataset.

The violinplot from the same seaborn library gives us even more insight into the structure of the data in question. Figure 4 below shows all three graphs, where the rocks are in the same order each time, and the color for the corresponding row is preserved.







Figure 4. Comparison of bar, swing, and violin plot plotted for the same dataset.

For example, Rottweiler data is shown in green.



The similarities and differences between the swing chart (box with mustache) and the violin graph are shown in the following Figure 5. First, the similarities: (1) both graphs in one form or another reflect 0.25-quantile, 0.5-quantile (median) and 0.75-quantile; (2) both there and there are the extreme values, which are close to the value of one and a half interquartile range (IQR), plotted from the bottom and top edges of the box - the very "mustache" for the swing diagram, outside of which there are "outliers".



The difference is that the violin graph also contains information about how the data is distributed internally. the boundaries of the constructed “violin” are the distribution density rotated by 90 degrees. And in this case, when analyzing the graph, we have much more information: in addition to the quantiles and values ​​describing the 4 interquartile distances (1.5 + 1 + 1.5), on the violin graph you can see if the data is evenly distributed or there are several centers where the values ​​are more often.







Figure 5. Explanations on the correspondence of the elements of the two graphs: span and violin.

This idea can be seen more clearly in the following graph (Figure 6), where the data for the two groups of Rottweilers differ, but are selected in such a way that the medians coincide (the leftmost graph) and even more - the swing diagrams (in the center) also coincide! And only the violin graph (far right) shows us that in fact the data structure is significantly different.







Figure 6. An example where only the violin graph allows us to see the differences in the internal structure of the data under consideration.

Using the K-Means clustering (cluster.KMeans) from the sklearn module, we can visually represent the grouped data by plotting a scatter plot using the scatterplot function of the seaborn module. Here, the color separates one cluster created by the ML algorithm from another, and the shape of the marker shows the original belonging to one or another group. There was no need to reduce the dimension using PCA or any other method, because the data is originally 2D.







Code for clustering and scatter plotting:









Thus, using the example of data on the height at the withers of several breeds of dogs, we got acquainted with some statistical characteristics of number series and the tools for their visualization. A simple tool provides a clear metric, but does not provide a complete picture. More sophisticated tools give a deeper picture of the data, but they are also more difficult to perceive due to the increase in the amount of information on the graph. And here it is important to choose a tool for a specific task in order to find a balance between the required completeness of information and the ease of its perception on the chart.



All Articles