"Alfa-bank is as reliable as a tank,
and Gamma-bank is as reliable as a bank!"
Victor Pelevin, "Numbers"
When the phrase "banking system" appears in conversations, the imagination draws a super-reliable system built on the most expensive equipment, clustered at all possible levels and fenced off from the outside world by accessible and inaccessible means of protection. Indeed, such systems exist. But ...
If you look at the vacancies of developers in the bank, then it is quite possible to see there among the requirements of knowledge of Cassandra, MongoDB and other platforms, which in no way inspire thoughts about 100% availability. And such DBMS as Oracle or Microsoft SQL Server are installed somewhere on a cluster of expensive servers connected to the most reliable and high-performance arrays, and somewhere on a regular virtual machine in a farm from the very commodity.
The reasons are obvious - redundant solutions are expensive. But how to find a compromise between the cost of the platform and its reliability?
Once upon a time, when there were few information systems in the enterprise, the infrastructure for each system was a work of art. Over time, there were more systems, it became expensive to support several hundred different hardware and software configurations, and infrastructure departments came to standardization. For example, a set of RDBMS infrastructure solutions that applications can use might look like this:
- hi-end class servers with hi-end class disk arrays plus synchronous replication;
- midrange class servers with midrange disk arrays plus synchronous replication;
- midrange class servers with midrange disk arrays plus asynchronous replication;
- commodity servers with midrange disk arrays without replication.
Now how do you choose a configuration for a specific database belonging to a specific application?
You can make a list of "the most important applications that must work at all costs." This raises two problems:
The hardware configuration for the remaining applications depends on the "weight" of the system owner. As a result, some service of electronic sick leave works on the most expensive equipment, because it is the favorite brainchild of the chief accountant, with whom no one wants to quarrel. This is an unreasonable waste of money.
Some applications may not be included in the "most important" list because they have not been thought about. For example, everyone remembers about the processing of bank cards, but no one remembers about checking clients on "black lists", which should work with every operation. As a result, the failure of such a system becomes an unpleasant surprise and leads to serious problems.
There is a formal methodology that allows you to make the right choice and protect what needs protection without overpaying for what you can not overpay for.
To begin with, a classification of applications according to the level of criticality is introduced. As a rule, there are four of these levels. They can be called, for example, like this:
- Platinum;
- Gold;
- Silver;
- Bronze.
Or like this:
- Mission critical;
- Business critical;
- Business operational;
- Office productivity.
These four tiers fit perfectly into 4 different infrastructure configurations. The only thing left to do is to classify each application as needed.
When evaluating, it is important to observe two rules:
The system is evaluated by the business, and not by the IT department serving it. Criticality should not be determined by how long or labor-intensive the system is to maintain. The only criterion is the losses that the business will incur from system downtime.
The wording of the questions that form the assessment should provide for the possibility of verifying the answers. Of course, the criteria are still based on expert judgment, but the expert, at least, can explain why he gave just such an assessment.
What determines the level of criticality?
- . , , , .
- SLA (service level agreement). , β , .
- . , , .
In world practice, something like this has developed:
This does not mean that in your enterprise the distribution of systems by classes should be exactly the same. But in any case, if more than 15% of the operating systems got into the Mission critical class, this is a reason to think seriously.
To the question "how much is this or that system needed", any owner will answer "very". Therefore, another question needs to be asked: what happens if the system stops? The criticality class of the system depends on the severity of the consequences of the system shutdown and the speed of their occurrence.
Let's take a look at several banking systems.
The settlement system provides (surprise!) Settlements between clients - legal entities. If suddenly a large corporate client is unable to make a payment to a counterparty, then the bank will lose a very significant amount, so the settlement system will undoubtedly fall into the highest class of criticality.
Let's take card processing. If a hundred or two customers cannot pay with a card, the bank's losses will be small, but such a massive denial of service is unacceptable in itself.
Now let's take a system that maintains deposits. If this system stops, the bank's losses will again be small, and denial of service will not be as massive as in the case of processing. But do we need a newspaper editorial with the headline "The Bank Refuses to Issue Deposits"? The question is rhetorical.
Finally, let's take the general ledger. If suddenly something happens to her, then the clients will not notice anything, since this system does not participate in customer service at all. But it is worth delaying the delivery of the balance sheet, as the sanctions of the Central Bank will not be long in coming.
So, the negative consequences of system downtime can be divided into 4 classes:
- Economic (direct losses);
- Client (denial of service);
- Reputational (negative reactions in the media);
- Legal (from warnings and fines to lawsuits and license revocation).
For each class of consequences, severity criteria should be formulated and assigned scores from 0 to 4. For example, a table might look like this:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Economic | no | <0.1% of planned profit | 0.1% .. 0.5% of planned profit | 0.5% .. 1% of the planned profit | > 1% of planned profit |
Client | no | 1 client | >1% | >5% | >10% |
Of course, all numbers are arbitrary, all calculation methods are based solely on expert judgment, and the scope for debate about what to consider as "regional media" and how to treat negative articles in popular blogs is truly unlimited. But in a large corporation there will certainly be both a legal department and a PR service who will readily express a competent opinion.
The next step is to choose the time intervals in which we will estimate losses. For example, hour, 4 hours, 8 hours, 24 hours. These intervals are arbitrary and have nothing to do with SLAs for the systems being evaluated. Although in the future it would be correct to tie typical SLA exactly to these intervals.
Now the owner of each system fills in a matrix of 16 cells. The numbers in the table below are given as examples. The only thing that is fundamentally important is that the estimate of the consequences for a longer interval cannot be less than the estimate for a shorter interval.
up to 1 hour | 1..4 hours | 4..8 hours | 8..24 hours | |
---|---|---|---|---|
Economic | 1 | 1 | 3 | 3 |
Client | 1 | 2 | 2 | 3 |
Reputational | 0 | 0 | 1 | 2 |
Legal | 1 | 2 | 3 | 4 |
There are three steps left to get the final score from this matrix.
Step one - for each time interval, select the maximum estimate:
up to 1 hour | 1..4 hours | 4..8 hours | 8..24 hours | |
---|---|---|---|---|
MAXIMUM | 1 | 2 | 3 | 4 |
Step two: we translate the obtained estimates into the criticality classes using the matrix:
Points | up to 1 hour | 1..4 hours | 4..8 hours | 8..24 hours |
---|---|---|---|---|
4 | MC | MC | BC | BC |
3 | MC | BC | BC | BO |
2 | BO | BO | BO | OP |
1 | BO | BO | OP | OP |
For this system, we obtain the following estimates:
up to 1 hour | 1..4 hours | 4..8 hours | 8..24 hours | |
---|---|---|---|---|
CLASS | BO | BO | BC | BC |
And, finally, from all the obtained estimates, we choose the maximum one - in this case, the system being assessed should be classified as Business critical.
Having received these estimates, we can reasonably choose one or another infrastructure solution for each system.
There are several nuances left, without which the described methodology would be incomplete.
If the system ensures the operability of another system, then its criticality class cannot be lower than the class of the dependent system. For example, Active Directory has nothing to do with business at all. But if suddenly it rises, then the consequences for many business applications will be the most dire, and therefore AD definitely belongs to the Mission critical class.
Losses incurred as a result of system downtime cannot be lower than losses incurred by interrupting the business process that this system provides. In light of this rule, it is very interesting to evaluate the corporate e-mail system, because suddenly it turns out that the exchange of critical information is tied to it.
If a company uses several blocks on the same system, and the estimates of these blocks for the system differ, then the maximum estimate should be used. Moreover, even the assessment criteria may be different. So, for example, the assessment of the impossibility to serve one client can differ greatly depending on what kind of client it is - an ordinary "physicist", VIP or a large corporate client.
Label your systems - and may your infrastructure be as reliable as it needs to be, but not more expensive than it can be!