Root cause analysis of incidents based on correlations between time series of IT infrastructure metrics

Introduction

One of the tasks of IT monitoring systems is to collect, store and analyze various metrics that characterize both the state of various elements of the IT infrastructure (CPU load, free RAM, free disk space, etc.), and the state of various business processes. In order to apply the extensive mathematical apparatus of statistical analysis, it is often more convenient to present these data in the form of ordered time series of the corresponding variables. A good tool for time series processing in Python is a combination of three modules: pandas, scipy, and statsmodels ( pandas.pydata.org , scipy.stats , statsmodels.org), which provide a wide range of classes and functions for building time series, for evaluating many different statistical models, as well as for conducting statistical tests and examining statistical data. Of all the mathematical panopticon contained in these modules, specifically in this article, algorithms will be described, in particular, correlation analysis of time series of IT infrastructure metrics, which we use for root cause analysis in the monq AIOps platform.





, – , ( ) , – . – , , . , - , , (β€œcorrelation does not imply causation”).





, - , . - (): (root cause analysis) ( , β€œβ€ - ). , .





, - () , .. , ( 5 , ..). , ,   : , 5 , - . , , . 1, pandas - , resample('5min').mean 5- , fillna(method='ffill') ( ) :





import pandas as pd

data=pd.read_csv('TimeSeriesExample.txt',parsedates=[0])

timeSeries=pd.Series(data['KEHealth'].values, index=data['Timestamp'])

timeSeriesReg=timeSeries.resample('5min').mean().fillna(method='ffill')

tsCollection.append(timeSeriesReg) 
      
      



1. .  





Monq β€œβ€ - . , . , 2.





2. β€œ ” .  





. pandas , (dataframe) corr(), , ( ):





import matplotlib.pyplot as plt

allKeDF=pd.concat(tsCollection, axis=1)

corrMatrix=allKeDF.corr()

pallet=plt.getcmap('jet')

img=plt.imshow(corrMatrix, cmap=pallet, vmin=-1, vmax=1, aspect='auto')

plt.colorbar(img)
      
      



3. 150 . 





3 150 , . , β€œ β€œ, . , - ( nan ). , - . , , , . : ,   r>0.7, 65 (0.29% ), r<-0.7 4 (0.02%). : , , . , , r>0.95.





4. - 5- 10-. 





4 , , - 5- 10-. , , , 5, ΞΌ=0, Οƒ=0.11. 5- 20- Οƒ=0.16, , , . , , .





5. 5- 10- .





6. .





6, - 7. , , , . ( ) t- , t=|r|√(n-2)/(1-r2), t- t t(α,k) k=n-2, n - . n ( ) . 7 t- α=0.05 . t<t, . t>t, . t scipy:





import scipy as sp
tCrit=sp.stats.t.ppf(1-alpha/2, ndf)
      
      



7. .





, -, : 1) - (root cause analysis) 2) , - . , - - (): , , , - , . , - - . , - , - ( ) , , - , . , - , : 3) . 





, - , , , , . - , , () .   -. 





, - ,   , . , - , .  , - .





monq , . , ( ), , . , , , , . 





- , , , ( r>0.7), 8. , , . 





8. , -38374, .





, -, - . - (, , ..) - . , : r>0.95





9 - , 3200 . 0.95 7470, 2310. 10, t- (c Ξ±=0.001 ). , t- , t- 3 . t- Ξ±=0.01 27. -, , , , , . 





9. - . 





10. , . 





, , , () , , . - , (), . , Mdist=||1||-Mcorr , ||1|| - , Mcorr... In the scipy module, you can build a dendrogram from the correlation matrix in several lines:





import scipy.cluster.hierarchy as hac

z = hac.linkage(1-corrMatrix, method='complete')

hac.dendrogram(z, colorthreshold=3, leaf_rotation=90., labels=allKeDF.columns)

plt.title('       KE', fontsize=12)

plt.ylabel(' ',fontsize=10)

plt.xlabel('KE',fontsize=10)

plt.show()
      
      



Figure 11 shows a dendrogram obtained from the correlation matrix of the time series of the health metric of 150 configuration units from Figure 3, in which the hierarchical clustering algorithm in different colors highlighted the KE clusters with the correlated behavior of the metrics, in fact, it divided the entire set of KE systems into related groups (subsystems) ... In the absence of a PCM system, such a partition already reveals some structure of the system and can be useful, for example, when searching for the root causes of incidents.





Figure 11. Dendrogram from the health metric time series correlation matrix for the 150 most volatile CUs in the system. 








All Articles