Process Mining without PM4PY





It is very easy to build a graph from the process logs. Analysts currently have at their disposal a sufficient variety of professional developments, such as Celonis, Disco, PM4PY, ProM, etc., designed to facilitate the study of processes. It is much more difficult to find deviations in the graphs, to draw correct conclusions from them.



What if some professional developments that have proven themselves and are of particular interest are not available for one reason or another, or you want more freedom in calculations when working with graphs? How difficult is it to write a miner yourself and implement some of the necessary capabilities for working with graphs? We will do this in practice using the standard Python libraries, implement the calculations and give, with their help, answers to detailed questions that might interest the process owners.



I would like to make a reservation right away that the solution given in the article is not an industrial implementation. This is some attempt to start working with logs on your own with the help of simple code that clearly works, and therefore makes it easy to adapt. This solution should not be used on big data; this requires significant refinement, for example, using vector calculations or by changing the approach to collecting and aggregating information about events.



Before building a graph, you need to perform calculations. The actual calculation of the graph will be the same miner mentioned earlier. To perform the calculation, it is necessary to collect knowledge about the events - the vertices of the graph and the connections between them and write them down, for example, in reference books. References are filled using the calc calculation procedure ( code on github). The completed references are passed as parameters to the procedure for drawing graphs draw (see the code from the link above). This procedure formats the data as shown below:



digraph f {"Permit SUBMITTED by EMPLOYEE (6255)" -> "Permit APPROVED by ADMINISTRATION (4839)" [label=4829 color=black penwidth=4.723857205400346] 
"Permit SUBMITTED by EMPLOYEE (6255)" -> "Permit REJECTED by ADMINISTRATION (83)" [label=83 color=pink2 penwidth=2.9590780923760738] 
"Permit SUBMITTED by EMPLOYEE (6255)" -> "Permit REJECTED by EMPLOYEE (231)" [label=2 color=pink2 penwidth=1.3410299956639813] 
start [color=blue shape=diamond] 
end [color=blue shape=diamond]}


and passes it to the Graphviz graphics engine for rendering.



Let's start building and examining graphs using the implemented miner. We will repeat the procedures for reading and sorting data, calculating and drawing graphs, as in the examples below. For examples, the event logs are taken from international declarations from the BPIC2020 competition. Link to the competition.



We read the data from the log, sort it by date and time. The .xes format was previously converted to .xlsx.



df_full = pd.read_excel('InternationalDeclarations.xlsx')
df_full = df_full[['id-trace','concept:name','time:timestamp']]
df_full.columns = ['case:concept:name', 'concept:name', 'time:timestamp']
df_full['time:timestamp'] = pd.to_datetime(df_full['time:timestamp'])
df_full = df_full.sort_values(['case:concept:name','time:timestamp'], ascending=[True,True])
df_full = df_full.reset_index(drop=True)


Let's calculate the graph.



dict_tuple_full = calc(df_full)


Let's draw the graph.



draw(dict_tuple_full,'InternationalDeclarations_full')


After completing the procedures, we get the process graph:







Since the resulting graph is not readable, we simplify it.



There are several approaches to improving the readability or simplifying the graph:



  1. use filtering by weights of vertices or links;
  2. get rid of noise;
  3. group events by name similarity.


Let's take approach 3.



Let's create a dictionary for combining events:



_dict = {'Permit SUBMITTED by EMPLOYEE': 'Permit SUBMITTED',
 'Permit APPROVED by ADMINISTRATION': 'Permit APPROVED',
 'Permit APPROVED by BUDGET OWNER': 'Permit APPROVED',
 'Permit APPROVED by PRE_APPROVER': 'Permit APPROVED',
 'Permit APPROVED by SUPERVISOR': 'Permit APPROVED',
 'Permit FINAL_APPROVED by DIRECTOR': 'Permit FINAL_APPROVED',
 'Permit FINAL_APPROVED by SUPERVISOR': 'Permit FINAL_APPROVED',
 'Start trip': 'Start trip',
 'End trip': 'End trip',
 'Permit REJECTED by ADMINISTRATION': 'Permit REJECTED',
 'Permit REJECTED by BUDGET OWNER': 'Permit REJECTED',
 'Permit REJECTED by DIRECTOR': 'Permit REJECTED',
 'Permit REJECTED by EMPLOYEE': 'Permit REJECTED',
 'Permit REJECTED by MISSING': 'Permit REJECTED',
 'Permit REJECTED by PRE_APPROVER': 'Permit REJECTED',
 'Permit REJECTED by SUPERVISOR': 'Permit REJECTED',
 'Declaration SUBMITTED by EMPLOYEE': 'Declaration SUBMITTED',
 'Declaration SAVED by EMPLOYEE': 'Declaration SAVED',
 'Declaration APPROVED by ADMINISTRATION': 'Declaration APPROVED',
 'Declaration APPROVED by BUDGET OWNER': 'Declaration APPROVED',
 'Declaration APPROVED by PRE_APPROVER': 'Declaration APPROVED',
 'Declaration APPROVED by SUPERVISOR': 'Declaration APPROVED',
 'Declaration FINAL_APPROVED by DIRECTOR': 'Declaration FINAL_APPROVED',
 'Declaration FINAL_APPROVED by SUPERVISOR': 'Declaration FINAL_APPROVED',
 'Declaration REJECTED by ADMINISTRATION': 'Declaration REJECTED',
 'Declaration REJECTED by BUDGET OWNER': 'Declaration REJECTED',
 'Declaration REJECTED by DIRECTOR': 'Declaration REJECTED',
 'Declaration REJECTED by EMPLOYEE': 'Declaration REJECTED',
 'Declaration REJECTED by MISSING': 'Declaration REJECTED',
 'Declaration REJECTED by PRE_APPROVER': 'Declaration REJECTED',
 'Declaration REJECTED by SUPERVISOR': 'Declaration REJECTED',
 'Request Payment': 'Request Payment',
 'Payment Handled': 'Payment Handled',
 'Send Reminder': 'Send Reminder'}


Let's group the events and draw the process graph again.



df_full_gr = df_full.copy()
df_full_gr['concept:name'] = df_full_gr['concept:name'].map(_dict)
dict_tuple_full_gr = calc(df_full_gr)
draw(dict_tuple_full_gr,'InternationalDeclarations_full_gr'




After grouping events by similarity of name, the readability of the graph has improved. Let's try to find answers to questions. Link to the list of questions. For example, how many declarations were not preceded by a pre-approved authorization?



To answer the question posed, we filter the graph by events of interest and draw the process graph again.



df_full_gr_f = df_full_gr[df_full_gr['concept:name'].isin(['Permit SUBMITTED',
                                                            'Permit APPROVED',
                                                            'Permit FINAL_APPROVED',
                                                            'Declaration FINAL_APPROVED',
                                                            'Declaration APPROVED'])]
df_full_gr_f = df_full_gr_f.reset_index(drop=True)
dict_tuple_full_gr_f = calc(df_full_gr_f)
draw(dict_tuple_full_gr_f,'InternationalDeclarations_full_gr_isin')






With the help of the resulting graph, we can easily give an answer to the question posed - 116 and 312 declarations were not preceded by a pre-approved permit.



You can additionally “fail” (filter by 'case: concept: name', participating in the desired connection) for connections 116 and 312 and make sure that there will be no events related to permissions on the graphs.



Let's "fail" for communication 116:



df_116 = df_full_gr_f[df_full_gr_f['case:concept:name'].isin(d_case_start2['Declaration FINAL_APPROVED'])]
df_116 = df_116.reset_index(drop=True)
dict_tuple_116 = calc(df_116)
draw(dict_tuple_116,'InternationalDeclarations_full_gr_isin_116')






Let's "fail" for connection 312:



df_312 = df_full_gr_f[df_full_gr_f['case:concept:name'].isin(d_case_start2['Declaration APPROVED'])]
df_312 = df_312.reset_index(drop=True)
dict_tuple_312 = calc(df_312)
draw(dict_tuple_312,'InternationalDeclarations_full_gr_isin_312')






Since there are no events related to permissions on the received graphs, the correctness of answers 116 and 312 is confirmed.



As you can see, writing a miner and implementing the necessary capabilities for working with graphs is not a difficult task, which the built-in functions of Python and Graphviz have successfully coped with as a graphics engine.



All Articles