Over the past five years, I have worked for the Machine Learning (ML) Model Validation Office at a large bank and have seen many bottlenecks in model development and validation.
In this article, I first intended to consider the main information systems of some abstract Bank X, since it is on the basis of already established information systems that the work of data analysts is built, and ML algorithms for decision-making are trained and work. But when I started writing, I suddenly discovered that in fact it is much more interesting to discuss a number of topics and subtasks that come up when building and validating the most basic models of the Bank, that is, models of credit risk.
Risk management and credit risk calculation can be considered the forefathers of data science in the Bank, since credit risk management is a primordially banking prerogative. It is skillful risk management that allows banks to offer something of value to the market of credit and financial relations. The idea that the bank simply pockets the interest margin between the interest on the loan and the interest on the deposit is fundamentally wrong, although I sometimes hear this from people unfamiliar with the inner workings of the banking business.
On the one hand, the bank assumes all the risks of non-repayment of the loan, and on the other hand, it gives guarantees to the depositor about the return of the invested funds. An alternative to a bank deposit is to lend your money directly to the borrower without a guarantee of return. The bank, in turn, is able to give guarantees, since on the one hand it has a "safety cushion" in the form of fixed capital and initially includes losses from non-repayment of loans in its financial indicators ("forms reserves"). On the other hand, the Bank knows how to calculate the probability that the borrower will not repay the loan issued to it. Of course, no one can predict exactly whether a particular individual or company will repay the debt, but on average, the probability can be estimated for a large number of borrowers.
The Bank will be financially stable only if the profit that it earns on the interest margin covers the losses from loan defaults and other related expenses of the Bank.
Well-established banking practice
Before moving on to discussing predictive models and data science tasks directly, let us dwell for a minute on the specifics of how a bank works with a client. A bank, and especially a large bank, is a well-organized system in which literally every step is prescribed. This also applies to interaction with borrowers.
In particular, in relation to borrowers, such a concept as "default" is often used. Default is a status that is assigned to a client when there is almost complete confidence that the client will not return the money to the bank, at least in full. The rules and procedures by which clients are assigned default status are negotiated at the level of a specially created working group. And then the above rules are prescribed in the internal regulatory documentation.
If a client is assigned a default status, it is usually said that "the client has defaulted." From the point of view of the Bank's processes, this means that certain procedures of interaction with the client will be launched. Perhaps the issue of bankruptcy of the borrower will be resolved, the Bank will try to sell the pledged property, collect funds from guarantors or sell the debtor's debt to collectors, etc.
It just so happened historically that the expected losses from non-repayment of loans are usually divided into three components:
EL = PD * EAD * LGD
where EL - expected loss, expected losses;
PD - probability at default, the probability that the borrower will be assigned a default status within the next year, starting from the assessment date;
EAD - exposure at default, all those funds that the client must return to the Bank on the date of "going into default", including both the issued amount and interest, fines and commissions;
LGD - loss given default, the share of the borrower's total debt to the bank, which the Bank will not return to itself. That is, it is a net loss for the Bank;
If I somewhere move away from educational definitions and concepts, then I apologize in advance, since my main goal is not to write a correct retelling of textbooks, but to grasp the essence of existing problems. For this it is sometimes necessary to reason "on the fingers".
Let's now try to formulate a typical task for a data scientist. The first thing to be able to predict is the probability of PD default. Everything seems simple here. We have a binary classification problem. Give us the data with the true class label and all the factors and we will quickly put together a script with double cross-validation and selection of all hyperparameters, choose the model with the best Gini metric and everything will be fine. But for some reason, in reality, this does not work.
There is no true class label
In fact, we do not know the true class label (target). In theory, the target is a binary variable equal to zero if the borrower is “healthy”, and equal to one if the borrower has been assigned the “default” status. But the problem is that the rules by which the default is determined are invented by us. It is worth changing the rules and the model no longer works even on training historical data.
We don't know our client well
With the accumulation of the history of issued loans, there is a desire to build more complex models, and this requires additional information about clients. It is then that it turns out that before we did not need this information, and, accordingly, no one collected it. As a result, there are many gaps in the collected samples, which negates the very idea of building a more "informed model". And if only that.
The presence of a large number of customers is tempting to break them up into segments, within which to build "narrower" and at the same time more accurate models. But the division into segments is also performed according to some rule, and this rule is based on all the same customer data. And what do we have? And we have gaps in the data, and, accordingly, we cannot always even understand to which segment a particular client should be attributed.
The regulator requires making models interpretable
By “regulator,” I mean the Central Bank, which requires models to be comprehensible. It should be clear not only the forecast itself, but also the rules by which this forecast was made. To be fair, I will say that to a greater extent this rule applies only to the so-called "regulatory" models. In order to ensure the stability of the banking system as a whole, the regulator constantly monitors the activities of banks according to a number of key indicators, among which, for example, is the calculation of capital adequacy to cover unforeseen losses during possible economic and financial crises.
What does the requirement for interpretability mean? This means that in most cases you will have to be content with models in the form of logistic regression or decision tree. You will have to forget about neural networks, ensembles, stackings and other "modern" architects.
Procrustean bed of established banking practice
The de facto industry standard requires the expected loss to be estimated as the product of three values: PD, EAD and LGD. This is true only when events develop according to the same scenario. The client either returns the loan or not. In the first case, it is considered that there are no losses. In the second case, it is assumed that there is a certain amount at risk (EAD).
In practice, the payment behavior of customers is not limited to two simple options, and the border between these options is rather arbitrary. The borrower can go into default in a month, a year, or two, and then, after being assigned the “default” status, suddenly return to payments and repay the entire loan. Moreover, deviations from the payment schedule can be in terms of both amounts and terms, ahead of schedule or vice versa. The financial result for the Bank in all cases will be different.
I am not saying that it is impossible to reduce all the variety of borrower behaviors to the three-component calculation scheme in principle. Of course, it all depends on the task. Where do we want to apply this model later? If, to assess the credit risk by pools (groups) of borrowers, then all possible deviations are taken into account by various calibrations and the calculation of weighted average values. But, if our goal is to personalize the approach to issuing a loan, including the personal selection of proposals, it becomes important to forecast the flow of payments from the client or to forecast the net present value.
Where advanced data-driven alternatives stumble
It should be understood that the entire industry banking practice was formed in those years when there was no Big Data or machine learning, and all calculations were reduced to building score cards. They took all the significant factors affecting the creditworthiness of the borrower and evaluated them in the form of points, then these points were summed up and, according to the sum of the points, it was determined whether or not to issue a loan.
With the accumulation of the history of issued loans and the development of computer technology, the decision-making procedures in the Bank gradually became more complicated. Scorch maps have evolved into logistic regression models that are built with python scripts. The Bank began to segment its customers and products in order to build its own narrow-minded models within each segment. On the other hand, with the growth of data storage volumes, it became possible to collect and store more and more information together in an interconnected form.
Ultimately, everything is moving towards the idea when for each customer who comes, the best offer (optimal banking product) will be found almost instantly, which would maximize CLTV (customer lifetime value) over a given time horizon, or another metric, depending on the current state of the Bank and goals its stakeholders.
Why not use a powerful neural network (that is, the notorious "artificial intelligence") to solve the above problem? I will list several circumstances that prevent this:
- The central bank requires that the models involved in calculating capital adequacy are applied in a "live" credit process. That is, it is these models that must be applied in making decisions on granting loans, be interpretable and pass a number of mandatory validation tests;
- customer data bases are constantly expanding and supplemented. For example, relatively new types of data are biometrics, web analytics, mobile app analytics, and social media scoring. Adding new attributes takes place in dynamics, and, accordingly, we have practically no historical data on them;
- the Bank's products and processes are constantly changing and CLTV recalculation for clients and NPV (net present value) calculation for new products is required. And in order to build a model of acceptable quality, you need to wait several years, accumulate historical data and calculate the actual values of CLTV or NPV on a sample of real borrowers;
Outcome:
With all the will, the construction of forecasting models in the Bank cannot be regarded as a purely mathematical problem. In practice, business problems are solved, which, among other things, are strongly intertwined with the requirements of the regulator in the person of the Central Bank.
Sometimes it seems that companies with strong data science can infiltrate the banking area and change the rules of the game. But in order to issue loans, you have to play according to the already existing rules, and therefore it becomes a Bank with all the ensuing consequences. The circle is complete.
The emergence of a cool new fintech startup in lending seems to be more about finding loopholes in the legal field than about innovating in machine learning.