🐯 👓 🚕 New functionality without bugs, for example, billing for a mobile operator 🔽 🕉️ 👩🏾

Hi, my name is Maksim Plavchenok, I work for Bercut, I do integration testing. In September, my team and I passed an important milestone: we received zero errors as a result of integration testing for the release of a new version of billing for a mobile operator. We went to this for two years; I want to tell you today how we managed to achieve our goal.

Zero errors based on the results of integration tests - here we are talking about testing new functionality on a business acceptance on the operator's side. A few words about how this testing works.

We release new versions of our billing software on schedule, 6 times a year. Release shipping dates are known in advance. At the time of this writing, we already have scheduled release dates for the entire next year.

This specificity is associated with the race of cellular operators for the time-to-market. The main principle: the subscriber should regularly receive new billing features. Payments via a smartphone, saving a number when changing an operator, the ability to sell unused traffic - updates can be different.

"We may not know what exactly we will release in a year, but we know the exact date when the update will be released." Due to this, it is possible to maintain the desired rhythm of updates.

On the side of billing development, about 70 people are involved in the release. These are 5-6 teams, each with its own specialization: analytics, development (several teams), functional testing, integration testing.

Yes, we have a waterfall in billing projects. But the current story is not about how we radically changed the development paradigm from waterfall to Agile or vice versa. Each development approach has its own advantages and is good under the right conditions; I would like to leave this discussion outside the scope of this article. Today I want to talk about evolutionary development: how we moved towards zero errors at release acceptance within the framework of the existing development approach.

Discomfort zone

At the time of the beginning of the described story, two years ago, we had the following picture:

teams at the end of the development chain were overwhelmed; “It's time to give to the next team, and the previous one has just started its part of the work”;
the customer could find around 70 errors after our testing cycles.

Errors that could be found based on the results of integration tests could be minor (“part of the message is displayed as dashes”) or even critical (“there is no transition to another tariff”).

We decided to change this alignment: we set a goal - zero errors in business acceptance for the new functionality.

After a year, we were able to reduce the number of errors to 10-15, and by mid-2020 - to 2-3. And in September we managed to reach the target of zero.

We succeeded due to improvements in several areas: tools, expertise, documentation, work with a customer, a team. Improvements varied in importance: knowing the specifics and processes of the customer is important, moving to a new scale for assessing the complexity of tasks is optional, and working with team motivation is critically important. But first things first.

Growth points

The main tool for integration testing is the test bench. There the subscriber's activity is emulated.

Shared stands

Dumps from production are rolled onto test benches so that you can test in conditions as close as possible to combat.

The catch is that the dumps on our stands and the customer's stands could be different. The operator makes a dump, passes it to us, we test new functionality, catch and fix bugs. We give the finished functionality to the customer, colleagues on the other side start testing. Ours and their dumps could differ in relevance: we tested in July, the operator - in August, for example.

The differences are not critical, but still there were. This led to the fact that when testing on the customer's side, errors could appear that we did not have.

What we did: We agreed that the data schemas used for testing will be the same, and in general we will have a common stand.

The dump lag remains, but we have configured the infrastructure on which this lag is minimal. Due to this, we were able to reduce the number of errors caused by differences between test and production environments.

Checking the settings before testing

When we give a new version of the software to the operator, for testing on the customer's side, we need to make the setting. Configure new functionality, possibly further customize the old one.

We wrote the documentation to tell you about the required settings. Only here the manuals could convey information with distortions. The documentation was written by people, read by people, and in the communication of people there is a misunderstanding.

This is the specificity of our software: the settings are subject to high requirements for flexibility and availability. The settings are complex, and without additional communication, it was not always possible to convey all the necessary information by means of documentation alone.

As a result, the settings could not always be performed correctly, which led to the detection of errors during testing on the operator's side. When analyzing, we found out that these were not software errors, but settings. Such mistakes waste valuable time.

What we did: we introduced a procedure for checking the settings on the customer's side before testing at the operator's stands.

The procedure is as follows: the customer chooses the cases that we show at the configured stand. We run tests. If there are errors, we will promptly correct them; if not, the test is passed.

This approach allowed us to reduce the number of errors associated with incorrect settings during integration testing.

Additional communications around documentation

Checking settings before testing in addition to describing these settings in the manuals is one example of additional communication around the documentation. There were others.

For example, we made it so that at every moment of time there was always a specialist on our side, to whom the customer could turn with questions about the documentation and the system as a whole. Something like a dedicated technical support line with our highly qualified specialists.

Our technical writers have organized workshops to educate client employees on the new functionality.

The process of transferring documentation became less discrete, more continuous: new information, clarifications, recommendations could now be sent in parts after the "shipment" of the main manuals; as it appears or on demand.

All this made it possible to better inform the customer about the new functionality and thereby reduce the number of errors on integration tests.

Expertise on working with third-party systems

To develop billing, we need to be able to keep track of traffic. There are separate PCRF systems for this. Calls are counted in one database, SMS in the same place, and traffic in another database; and there is special software that synchronizes all this.

At the same time, PCRF systems are third-party proprietary software. That is, a black box: we send data there, we receive something in return, but we cannot control what happens inside. Moreover, we cannot change anything there.

This alignment limited our ability to localize and fix traffic-related bugs.

What we did: We set up a separate internal PCRF knowledge base. Every incident, every customization option, every insight - everything was recorded and shared by the team.

As a result, we became good users of the PCRF system, we can customize it and understand what it should do. This saves time on simple incidents. With complex cases, of course, we still turn to the system developers for help.

More stands

Another feature in testing the billing of mobile operators is that custom scripts can be stretched over time. The complete script we want to test can take days or even weeks.

It is difficult to wait several days or weeks during the testing phase. In reality, to check such scenarios, most often it is simply time to unwind in the database.

To rewind time, you need to close all sessions except your own. We get a situation when, conditionally, 20 testers can apply for two test benches, and everyone wants to rewind time. This is the queue. And the queue is the probability that by the agreed date of shipment of the software we may not have time to check everything properly.

What we did: set up a separate stand for each tester.

This allowed us to remove the mistakes that happened due to “my turn came to the stand late, I didn't have time”.

Virtualization

Booth preparation is not a quick process. You need to connect to the operator's network, request access, and that's not all. The complete procedure could take up to several weeks. The struggle to reduce the time for preparing the stand was an important direction in moving towards the goal of zero errors.

What we did: Enabled virtualization.

Copying virtual machines with all the necessary settings, preinstalled software and automating this process helped to reduce the time for preparing the stand to “within a day”.

Planning

Errors from integration tests are also the result of miscalculations in release planning. We swung a lot, at the time of the fixed release date, not everything was in time.

What we did: Introduce interim deadlines for each stage of development. “If you know the end date, then you know every intermediate” - this principle helped us to better control the speed of movement towards the release goal.

Support and release in parallel

At the beginning of our journey, there was a situation when the "debts" of the last release conflicted with the next release. After acceptance, bugs arrived on the customer's side, and everyone went to fix them.

At the same time, the release schedule did not move. As a result, at the time when it was time to tackle the next release, we could still continue to work on the previous one.

It was possible to change the situation due to the separation of two groups from the team: who will fix errors from acceptance and who will deal with the new release on schedule.

The division was conditional: not necessarily half there, but half here. We could transfer people between groups as needed. From the outside it might look as if nothing had changed: here is a person from the team, during the sprint he worked out both bugs and new features. But in fact, the selection of individual groups was an improvement from the category of "now we can breathe out." The focus of each group and the parallelization of work between groups helped us a lot.

Chronologically, this was one of the first growth points that we formulated on postmortem. And here my story comes to the part about the main instrument.

Main tool

The improvement that helped us the most is honest postmortems.

Someone calls it a retrospective, someone - an analysis of the results; the word "postmortem" has stuck in our team. All the improvements described in this article were invented on postmortems.

The principle is simple: there was a release, you need to get together and honestly discuss how everything went. It sounds simple, but there are pitfalls in the implementation. After an "unsuccessful" release, people in the team have the mood "there is no time to scratch with languages, you need to do something." Someone may come to postmortems and remain silent (and thus not deliver some of the potentially useful information).

For two years of moving towards the goal of zero errors, we have developed a number of principles for how we conduct postmortems.

Assemble the complete picture

We invite an expanded list of participants. Developers, testers, analysts, managers, executives - everyone who has a desire to speak up. Organizationally, it is not always possible to gather everyone, everyone. It's ok, it works like that too. The point is not to deny the participation of colleagues with the wording “here in our team we are summing up the results, you sum up in yours”. Work with stands, code, processes, interaction - we strive not to lose sight of any aspect.

Don't grab onto everything at once

Ok, as a result of the post-mortem, we came up with 30 growth points. How much to take to work? Maybe we can work it out until next time? The “pick 2-3” format worked best for us. In this situation, there is a focus, and the efforts of the people in the team are not diffused. It is better to do less, but completely, than a lot, but not bring to mind.

Don't be smart with the format

There are many approaches to conducting postmortems. Facilitation practices, techniques from design thinking and lateral thinking, Goldratt's technique and other respected experts. In our experience, common sense is enough to start. We wrote down the problems, grouped them, chose several clusters, pushed aside the rest (see the previous point), discussed, fixed the plan. When there is a common goal, it is not so difficult to find a common language.

Take to work

Perhaps the main principle on this list. No matter how promising and convincing the list of improvements based on the results of the postmortem is, if it does not go into work, then everything is in vain. We agreed, then we are doing it. Yes, there are other urgent matters. But we also have a goal, and we want to get closer to it.

Postmortem can be quite painful. Talking about failure, even in a constructive way, is not easy. But fighting the discomfort is worth it. I am sure that without postmortem we would not have been able to come up with and implement all those things that helped to achieve the goal of zero errors in the release.

The most important tool

Postmortem allows you to find means to achieve the goal, but if you look at it, then it can also be called a consequence of a higher-level principle.

The most important tool is team involvement.

There is an instrumental side to involvement. For instance:

if we work overtime, the boss is next to the team, helping with his hands;
if you cheer up the team by tracking progress, then you can find visual metrics (it's not difficult for the number of errors).

And further in the same spirit.

Involvement also has a difficult to formalize side - the ability to share your belief in success with the team. After all, my team and I didn’t just look into the brochure with the company's values, saw there “stronger together” and decided that, bingo, a solution had been found. We've seen examples of how challenging goals can be achieved through joining forces. We had people in our team who believed in success and tried to convey this belief to their colleagues. The rest is a matter of technology.

People are the most important thing.

There was a lot more in the release towards the goal of zero bugs. Work on improving documentation, increasing the speed and quality of response to customer questions are different. This time I tried to share only some examples and talk about the basic principles.

The team and I still have a lot to do in the fight for release quality and time-to-market optimization. Make the result regularly reproducible with zero errors on integration tests, automate regression.

How to achieve these goals remains to be seen. But what we know for sure right now: we will definitely make postmortems and implement growth points based on motives. And we will try to use the opportunities that the involved team has.

Hope some of this might be helpful to you too.

New functionality without bugs, for example, billing for a mobile operator