Can you write Deadlock on Camunda BPM? I can

image


Some time ago I wrote about the successful migration from IBM BPM to Camunda, and now our life is full of happiness and pleasant impressions. Camunda did not disappoint, and we continue to be friends with this BPM engine.



But, alas, Camunda can also present unpleasant surprises, due to which sometimes not the most obvious results are obtained. This article will consider one case, which, despite its simplicity, turned out to be interesting and somewhat more complex than it seemed at first glance.



We train on cats



To describe the problem, consider a synthetic example. Let's say we decided to expand our client base and need to serve cats and cats. Each potential customer should be checked and, perhaps, immediately offer something.



We will check the reliability of the candidate and the possible services that we can offer him. Reliability check and possible services are not connected in any way - these actions can be performed in parallel. Schematically on a bpmn diagram, it will look like this:





Diagram 1. Schematic process of serving furry



The diagram schematically shows the basic steps, fork and join gateways.





This icon represents a parallel gateway. Parallel Gateway is the simplest gateway for building a parallel-running part of a process.



There are two types of parallel gateways:



  • fork - creates a separate execution for each branch;
  • join - waits for all incoming executions.


Execution - represent a 'path of execution' in a process instance (from the documentation). That is, it is a process execution thread.



Now let's complicate the task a little. We will check and search for services in the following way: first we check the state of the client, then we look at what services can suit him, and do some preprocessing. In addition, several services may be suitable for the client at once, so we should be able to offer all of them to the client.



Since we work with furry clients, the services will be appropriate: valerian, claw-rake, master's pillow and other useful things.





Diagram 2. Updated fluffy customer service diagram. The



new version of the process is as follows. The process is parallel to checking reliability and searching for possible offers. Search is also parallelized. In this case, those branches on which the corresponding conditions are met will be executed.



For parallelization with conditions, the Inclusive Gateway is used, which is indicated by the following icon:





Inclusive Gateway is a parallel gateway with branch conditions. Branches will be executed on which the conditions are true.



There are two types of gateways:



  • fork - for each branch with a fulfilled condition, execution is created, which is executed in parallel in the same way as execution in Parallel Gateway;
  • join, unlike Parallel Gateway, waits for executions not for all branches, but only for those on which the condition is true.


It may happen that the checks performed are not enough and the client will have to be checked again. To do this, add a condition at the end of all checks, which can be sent for re-checking at the very beginning:





Diagram 3. The final version of the process that should work



It turned out to be cumbersome, but the process solves the problem.



What? What happened?



Here strange things begin to happen. The reliability verification branch fulfills and reaches the collecting parallel gateway. So far, everything is going fine.



The second branch checks the material condition, and depending on the results, corresponding tasks are performed. Next, the process stops at the gateway collecting Inclusive Gateway and does not move further. If you look at the Coockpit (admin panel of the Kamunda), then executions will hang on the collecting Inclusive Gateway and Parallel Gateway.





Diagram 4. Hanging maintenance process



. Done. We can say that we got a deadlock in the process on Camunda. In this case, it is not directly related to deadlocks from the theory of parallel programming and deadlocks.



In search of ̶̶r̶i̶k̶l̶uch̶e̶n̶i̶y̶ answer



Since I did not have a sufficient understanding of what happened and why the process stopped, the problem had to be solved empirically.



Perhaps you need a default branch for the Inclusive Gateway, and without it, the process cannot run normally?



Strange, of course, but trying to add the default branch. The presence of a default branch is a good practice, since otherwise not a single condition may be met and then we will get an error.





Diagram 5. Service process with default branch



Launch and get the same result - the process remains hanging on the gathering Inclusive Gateway.



Next is sorting out all sorts of parameters, reading the documentation, and it drags on for half a day. On another attempt, the process unexpectedly passes the ill-fated getvey. The lower branch with Inclusive Gateway worked in a situation when during the search and debugging process the upper branch was deleted with a client reliability check. That is, when the process degenerated only into the lower branch with the Inclusive Gateway, the process ended.





Diagram 6. Degenerate process



It turns out that Parallel Gateway somehow influences the Inclusive Gateway. This is weird, illogical, and it shouldn't be.



How is this possible? It is probably worth rereading the theory on how Parallel and Inclusive Gateway works again. What needs to happen for the join gateway to get everyone together and the process go on? On the Internet, they write that each collecting Inclusive Gateway (join) waits for the same number to come into it as left the “fork”. Then one more question suddenly arises: how does this counter work at all?



What are you? How do you work?



This problem is worthy of puzzle games and intelligent television shows. Only on TV shows are they allowed to call a friend. On the other hand, I can also ask for help. We will call our business process architect Denis.



- Denis, hi! Can you tell me how the collecting getway determines when it's time for the process to move on? Everywhere they write: "How much has come out - so much should come in." But how exactly does he think it?

- Very simple. Camunda counts the number of active executions.

- Thank you so much. For now,




consider what happened. To do this, once again recall the initial scheme, which turned out:





Diagram 7. Hanging process with a default branch



For simplicity, let's consider the case when all conditions are met. What do we have at the time when three tasks after these conditions are fulfilled?



How many active executions? Three on the lower branch and one on the upper, where we checked the reliability of the client. Camunda doesn't care that these are different parallel branches altogether. I'm only interested in the number of active executions, of which there are four, and the incoming inclusive gateway received only three.



We correct



To rectify the situation, the collecting Gateway must collect all executions at once, and then, in theory, the process will move on. Let's try to leave one instead of two join gateways:





Diagram 8. Corrected version of the process



Alas, after the changes the process began to look, in my opinion, less obvious. But it worked as originally planned. At this point, the quest ended safely, I was able to push the changes and go home.



The fun is just beginning



When I sat down to write this article and came up with an example of a process on which I could describe this case, I was disappointed: the process worked as it should and there was no deadlock.



At first I assumed that the Camunda version in the example is higher than in the project, and in the new version this problem has already been fixed. But downgrading Camunda did nothing. By the way, in all examples version 7.8.0 is used - it is far from the most recent, but it does not matter in principle. The problem was also checked and reproduced on the latest version at the moment - 7.13.



Through trial and error, the problem was fixed. The original fake example did not have a reverse branch, unlike the process I was developing in the workplace.



It turns out that in the presence of a reverse branch, the problem is reproduced and we find ourselves in a kind of deadlock, but without a reverse branch, everything works as it should.



The case demanded understanding and analysis. To do this, I had to look at the Camunda BPM sources. Since the problem was with the Inclusive Gateway, it seemed logical to look for the answer in the class that is responsible for the behavior of this element - InclusiveGatewayActivityBehavior . After running debug on both versions of the process a couple of times, I figured out how it works.



If it is not clear - see sources!



In order not to arrange a dull storytelling, the description of InclusiveGateway's work based on the source code will be schematic. The logic of interest to us is concentrated in the execute method , where the activatesGateway method is the most valuable for this case . As I understand it, it checks whether it is possible to pass the InclusiveGateway. The execute method is called for each execution (for each running branch). In our case, there are three such branches, which means that this method will be called three times.



Let's see how the activatesGateway method works. For better understanding, let's give names to all executable branches.





Diagram 9. Process diagram with executions



As I understand it, the logic of the method is as follows:a comparison is made between the number of executions who came to this getty and the number of arrows included in this getaway . This check is made in the case of the simplest situation, when all branches of the Inclusive Gateway are executed, and the logic for checking the collecting gateway is to wait until the number of entered executions is equal to the number of incoming arrows. That is, in the simplest case, the execute method is called as many times as there are branches in the collecting gateway, then the process continues.



In our case, this method is called three times, because the number of incoming executions will increase from 1 to 3. At the last call, the number of incoming and outgoing executions will be 3 and 4, respectively, and we will follow the false branch.



If the condition is not met, the remaining executions are checked for belonging to the Inclusive Gateway. Namely, the ability of active executions to get to the join Inclusive Gateway is checked.



Here you need a little patience, exhale and read. The denouement is close!



In the false branch of the activatesGateway method, every call that has not yet arrived in Inclusive join executions is checked for the possibility of reaching this join. If at least one execution can lead to an Inclusive Gateway, you need to take it into account and wait for it to come to this join too. If there are no executions that can lead to join, the method will return true.



The most interesting part is coming. At first glance, the last execution (in the diagram - execution 1) cannot lead to Inclusive Gateway. But it is worth looking at the implementation of the canReachActivity method , which deals with this check, and the reason for this behavior of this element will become clear.



If we discard all the details of the code, then inside this method the isReachable method is recursively called, which, step by step, checks whether this execution can get into the collecting InclusiveGateway. The reverse branch just gives such an opportunity, and, alas, this is taken into account, although it should not, since we will go back after all the joins.



As a result, Inclusive Gateway is waiting for another execution, which will never come. Thus, we get a kind of deadlock. In principle, if we discard the conventions, we get a classic deadlock: join on Parallel waits for the branch with Inclusive to be executed, and, conversely, the branch with Inclusive waits for Parallel to execute.



The diagram below shows an approximate direction of checking the availability of the Inclusive Gateway join from execution, which came to join Prallel Gateway via a parallel branch.





Diagram 10. Possible path from Parallel Join to Inclusive Join



The diagram shows that, indeed, the join Inclusive Gateway is available from the Parallel Gateway join, and according to the Camunda BPM logic, it does not matter that it is already “ahead of the circle”.



After finding out the reasons, the question involuntarily arose: is this a bug or a feature? In my opinion, this is a bug. Now I am collecting information and cases to send a report to the Camunda team.



It’s good that the problem is localized. But what about now?



Actually, now - the conclusions:



  1. Forewarned is forearmed. We must build our processes, taking into account this behavior of Camunda.
  2. , . parallel join.
  3. Inclusive Gateway , , executions .
  4. , . , Parallel Gateway .


Seeming simplicity and clarity are sometimes deceiving. This can only be fought through the accumulation and replication of knowledge. Alas, at the time of solving this problem, I did not have a deep knowledge of the logic of Inclusive Join, so I had to tinker. I gained this knowledge by trial, error, calling a friend and debug source.



From all this follows an obvious and far from new conclusion that you need to understand how the tool you are using works. The better you understand, the fewer such problems will be.



The second conclusion is also quite obvious: you need to decompose not only the code, but also the processes.



Links that were useful in parsing this case and writing an article:



  1. Sources Camunda BPM
  2. Description of Inclusive Gateway operation
  3. How Parallel Gateway Works



All Articles