T-shirts, money, two cakes: how we forgot how to evaluate tasks





Hello, Habr! My name is Artyom and I am a team lead at Skyeng. My development team has a customer, he's a product manager, he's just Vanya. Vanya believes that our task assessment scheme is not ideal. For example, an assessment of 2 days does not give him anything. He will see his task on sale in a week or 10 days. Or more. Or less.



This happens not because we are failing tasks, but because with the traditional Estimate, in reality, we only estimate the time it takes a developer to write code. But there is also testing and code review. Ok, we’ll put it all in the assessment. But also:



  • we have a queue just before development and testing,
  • there are improvements, we are not without sin,
  • urgent tasks fly in
  • when an implementation affects several services, we expect reviews from related teams.


How to learn to answer the question "When?" if predictability is out of the question?



How we questioned Estimate



In our team, like many in the company, there is a very useful meeting - Technical Review (or, in abbreviated form, tehrevyu ). It requires a decent amount of time and effort, but it adds predictability: we pre-paint the technical solution to the problem, and at the same time evaluate it.



Since we are always remote, everything happens in JIRA: there is a board on which the stages of work are visualized. The card leaves the Techview status and moves to Ready for Development after we have described and evaluated everything. It is at this moment that we commit ourselves to complete the work.





"Ready for Development" has a WIP limit - there can be no more than 8 tasks at the same time. There is the opposite rule: as soon as the tasks in the column become fewer, we initiate a new technical review.



Fact: We spend significant time evaluating. The tech review usually takes place twice a week; 4 tasks with detailed elaboration and assessment can take 1.5-3 hours. But! Then we can still take the time to figure out why Estimate was exceeded.



However, neither rating nor debriefing adds value to our product. Rather, we are wasting time on them. And money. For a long time I doubted the need for these procedures, and at one point I matured to a serious conversation with the product. And we both recognized the problem.



"The shirt is dry and completely ..." not XS



We decided: let's experiment with assessment approaches. I suggested sticking with T-Shirt Size - T-shirt sizes are used as a unit of measurement in this technique. You need to find the smallest task that you had to do, and take it for XS. After that, the remaining tasks are evaluated according to the principle “how much larger they are XS” - and depending on this, they are assigned the size S, M, L or XL.





Bribed the opportunity to evaluate “by eye”. The idea was simple: we will collect statistics on how much the development takes to complete a task of one dimension or another, calculate the average and be able to predict the timing.



An error in a day or two will be forgiven by the customer - which means that there will be no more debriefing. And on tech review you won't have to waste time on interactive and secret voting. Everything is smooth!



We have been working this way for several months, collecting statistics. And only Ivan looks askance at us.



It turned out that XS, like S, we do it in 1 day, then in 10. And on L we spend 5 or 15 days. Because in fact, we take some work in the first place, some in the second, and some in the fifth - and tasks of the same dimension spend different times in waiting statuses. Oops, here's the average.



In short, the spread here is not in a couple of days - and for Vanya, little has changed. We found the experiment unsuccessful, but still the idea that tasks can be categorized somehow stuck in my head. And I began to think in this direction further.



“Everyone loves cakes. Puff! " Donkey from Shrek



And I love. Plus, a child's birthday is a great occasion! I go to my favorite site and start choosing:



  • it’s possible, but it’s not possible,
  • You can decorate, but you can not decorate,
  • it can be 2kg, but it can be 5kg.


I will not reveal my taste preferences, but I chose the cake. And they brought him to the appointed date. Next comes the philosophy of a team lead who has eaten too much cake.



Of course, I am not Newton, and the cake is not an apple, but the inspiration came.



I could choose from many options, but no matter what I chose, the delivery date did not change. I needed a cake in a week. And I was ready to provide this service. And the size of the cake, weight and all sorts of bells and whistles did not greatly affect the final result - more precisely, in this case, did not affect at all. It's not about the size, as they say. And in what? In the price.



For example, the guys had an express order: for an additional fee, they would have brought me the same fancy cake in just a couple of days, and not in 5. My order, as the most valuable in comparison with others, would have gone out of line. Basically, the bakery has two SLAs: one for regular order and one for VIP. There is something to think about.



The SLA idea triggered because I read about it in the Kanban Guide



From the point of view of the Kanban method, everything is a service. And despite the fact that we do not supply cakes, and our product cannot be touched or eaten, development is also a service. And we also have different attitudes towards tasks.



Recall our board: The





service consists of several stages (development, code review, testing), and the column “Ready for development” is our commit point to the customer.



We do some things in our usual rhythm, but when burning tasks arrive, we drop everything. It remains to understand what SLA we have - and it will be possible to conclude an agreement with Vanya.



How to evaluate the SLA of your team: building a spectral diagram (it's simple)



To understand what service classes we have and what SLAs they have, Kanban suggests building the following graph:



  • Lead Time (LT) — . « » «».
  • Y LT1, LT2, LT3 ..


We took the tasks that were closed over the past few months and received the following:





We closed 3 tasks in a day, 6 in two, most of all in 5, and somewhere we were fighting over the task for more than two weeks ...



Well, now it's time to analyze. What are these tasks? Why did they end up here? Why do we do more in certain LT than in others, what is there? You can dig up to customers and performers, as well as study comments on the task.



Here's what we got to dig. This is our regular job .



image

The spread is quite large, but it is amenable to analysis.



In general, the bulk of the tasks were distributed in the interval of 7-14 days, and the couple flew very far - in this tail there were several tasks (not all) from PR to other services. Those tasks that completed in 3-4 days are more likely an exception than a rule.



, , , 75% 10 .



And with a 90% probability, it will take 14 days. Well, if the development affects other services of the company, you will have to wait a little longer - we need a code review from another team and then another deployment.



Let's go further. We named this class "Important" .





For some reason, these tasks are taken to work earlier than others: there is either more value or the cost of delay is higher.



And here we can also voice the SLA: with 75% probability the task will go on sale in 5 days, with 90% probability in 7. Do we continue?



The very tasks for which we give up everything and saw, saw, saw are blockers .





In 100% of cases, these are minor improvements that we did not take into account when implementing the main feature, or bugs that affect vital functionality on the prod.



Despite the fact that we managed to resolve all such situations in 2 days, we will still announce the 90th percentile. Firstly, you shouldn't promise 100% - never to anyone :) Secondly, you need to build variability: let's remember the case with regular work, when several tasks flew away in 20+ days, because dependence on other teams appeared.



Done! We can agree with Vanya on SLA for all classes of service:





We have chosen exactly 90% in terms of terms - this is, in fact, the customer's tolerance for non-compliance. That is, if 1 out of 10 tasks does not meet the SLA, they are ready to forgive us.



If your customer is not so kind, it is better to voice the 95th percentile, for example.



Instead of a conclusion



- And what prevents Vanya from gaining only important tasks or blockers?

- Horizontal WIP limits.


We agreed to limit the number of tasks in the service class: you cannot take more than two blockers, you cannot take more than two important tasks. You may have other numbers - this is a matter of agreement with the customer. You can’t put such limits in JIRA without plugins, so an oral agreement is definitely needed. Tools are tools, but without human interaction, nowhere.



Thank you for your attention and successful planning!



All Articles