Array balancing for ML with insufficient number of minority objects in the array

When modeling a process using ML (machine learning), one of the most laborious and demanding tasks is to create a data array sufficient in volume to create a model with high quality characteristics. What if there is not enough data?





As part of one of the tasks of creating a mathematical model that assesses the likelihood of manipulating financial statements provided by the client to the bank, the problem of insufficient data for training the model with a teacher was fixed. Quarterly financial reporting (FO) was chosen as the object of the array. The array consisted of several thousand objects, and this was enough for our task. The problem appeared during the formation of the values ​​of the target variable. Analysts have identified only 20 proven cases of financial reporting manipulation. This is an extremely small number for an array of several thousand objects. If the array is randomly split, in our case by 5 folds, when using the cross-validation function, there is a high probability thatthat any of the folds will be without objects with proven cases of FO manipulation. In this case, the cross-validation functionality will be useless and the process of training the model will end with an error.  





At first glance, there is a solution to this problem, which consists in using the "indersampling" method, the essence of which is to duplicate in the array those objects for which, in our case, the facts of FD manipulation have been proven. As it turned out, the use of the indersampling method solved the problem of cross-validation, but it was not possible to create a model with acceptable quality metrics. It was concluded that the use of the “indersampling” method is not advisable in the case when the number of objects of the minority class and the majority class differ by several orders of magnitude. In our case, the duplicate method creates a large number of objects in the array, which are full copies of their parent. In this case, the array loses its uniqueness, and training on such a sample leads to overfitting of the model.The evidence of this fact is demonstrated by the quality metrics of the model on the test sample.





Graphs of ROC_AUC metric values ​​on test and training samples depending on the number of objects in the sample:





The maximum value of the ROC_AUC metric obtained on the test sample approaches 0.55 and for our case the result is not satisfactory. Also, with increasing objects in the sample, the value of the ROC_AUC metric deteriorates, this fact indicates that the model is not suitable for operation.





As a result of the work done, it was decided to create a method for finding objects similar to minority objects in the array in terms of their parameters and transferring these objects to the minority class. The description of the method is offered in a truncated version, since the method is implemented for the array that participated in the creation of the model.





, «» « ». , , 20% . , 25% , , .     . , . 8 – . T/SQL :





PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY “PARAMETER”) OVER (PARTITION BY “CLIENT”)
      
      



8 , 8 .





():





= ( – _Me/_Me) *100;





= ( – _Me/ _Me) *100;





= | - |;





90% , , . 20 330. , .





ROC_AUC :





ROC_AUC, 0,84 . ROC_AUC , , .





To achieve a certain balance of minority and majority classes in the sample, you can use the SMOTE or ASMO algorithms from the imblearn library.





Both algorithms search for "nearest neighbors". It is advisable to use such a method when there is great confidence that all objects in the minority class are, according to their parameters, representatives of this class. In our case, the objects fell into the minority class based on the judgments of analysts, and in the process of balancing the sample based on the developed algorithm, objects were found that, in terms of their parameters, turned out to be the most prominent representatives for assigning them to the minority class.








All Articles