Forming a training data sample with distribution shift

Disclaimer: This article is a translated product of Max Halforda . The translation is not clean, but adaptive. Such that there is understanding at any line of knowledge.







“My friends and I recently qualified for the Data Science Game 2017 final. The first part of the competition was Kaggle with a dataset from Deezer (1). The problem consisted in solving the problem of binary classification: it was necessary to predict whether the user was going to switch to listening to the composition offered to him.



Like other teams, we have extracted the relevant features and trained the XGBoost (2) classifier. However, we did something special, a subsample of the training dataset, so that it (the training dataset) became more representative of the test set. "




One of the basic requirements for the learning process for the successful operation of a machine model is the same nature of the distributions in the training and test datasets. As a rough example: the model is trained on users of 20 years old, and in the test sample, users are 60+ years old.



Here it is intuitively natural that with the age at which the model was not trained, it will not cope. Of course, such an example is purely synthetic, but in reality, for significant differences, it is enough to train the model for 20+ and try to make it work for 30+. The result will be similar.



This is because the models learn the distributions (3) of the data. If the distributions of a feature in the training and test sets are the same, the model will thank you.



Translator insert: When I sat down to translate, I had a question: why should training be tailored to the test, because, in fact , the test reflects invisible data that will enter the model as input to production. Then I overslept, reread, and everything went away. The trick is that, under the influence of factors, retrospective may become irrelevant for the present. More on this later (the example is slightly re-adapted).



Bias in distributions for one feature can occur for various reasons. The most intuitive example can be borrowed from Facebook.



Let's say a company was trained on a model that was based on a feature (a feature is the same as a feature) as a pastime in minutes. Let it synthetically predict the level of user loyalty on a ten-point scale.



When the split of the common Facebook application into the main social network (feed etc) and the messaging system happened, the time in the main application decreased, that is, the incoming datasets changed and no longer corresponded to the past retrospective.

Mathematically, taking into account the feature of time, the model will predict a lower level of loyalty, although in reality this is not the case - the time transfer was simply divided into two applications. It comes out sadly.



Thus, a distribution shift occurs when the distribution of historical data becomes irrelevant for predicting new data.



In the Deezer dataset, the distribution mismatch was in the feature measuring the number of songs listened to before the prediction problem was solved. This feature had an exponential (4) distribution in both the public and test datasets. However, in the test dataset, it was more pronounced, so the mean in the training set was lower than in the test set. After resampling the training distribution, we managed to increase the ROC-AUC metric (5) and climb the rating by about 20 points.



Below is an example of the distribution difference:



import numpy as np
import plotly.figure_factory as ff

train = np.random.exponential(2, size=100000)
test = np.random.exponential(1, size=10000)

distplot = ff.create_distplot([train, test], ['Train', 'Test'], bin_size=0.5)
distplot.update_layout(title_text=' Test, Train')


" "



The idea of ​​leveling the distribution shift is to reshape the training sample to reflect the test distribution.



Let's imagine that we want to create a subset of 50,000 observations from our training set to fit the distribution of the test set. What do you want to do intuitively?



Make it so that the objects that are more common in the test dataset are also common in the training! But how can you determine which objects are needed more and which less often?



Libra!



The steps will be something like this:



  • divide the numerical straight line of distribution into equal intervals (or baskets (bins)
  • count the number of objects in each basket (bin size)
  • for each observation in the basket, calculate its weight equal to 1 / (bin size)
  • create a subsample of k with a weighted distribution (objects with a higher weight will appear in the subsample more often)


Transferring to the code, we perform the following actions:



SAMPLE_SIZE = 50000
N_BINS = 300

#   ,       .
#        
step = 100 / N_BINS

test_percentiles = [
    np.percentile(test, q, axis=0)
    for q in np.arange(start=step, stop=100, step=step)
]

#     . 
#    ,    
train_bins = np.digitize(train, test_percentiles)

#          i   ,
#  0      , 1    1    i 
train_bin_counts = np.bincount(train_bins)

#    ,        
weights = 1 / np.array([train_bin_counts[x] for x in train_bins])

#   ,     
weights_norm = weights / np.sum(weights)

np.random.seed(0)

sample = np.random.choice(train, size=SAMPLE_SIZE, p=weights_norm, replace=False)

distplot_with_sample = ff.create_distplot([train, test, sample], ['Train', 'Test', 'New train'], bin_size=0.5)
distplot_with_sample.update_layout(title_text=' Test, Train, New train')


" " The



new distribution (green) now better matches the distribution of the test sample (orange). We used similar actions in the competition - the original dataset contained 3 million rows, we generated the size of the new sample from 1.3 million objects. The data became smaller, but the representativeness of the distribution improved the quality of training.



A few notes from the author's personal experience:



  • The number of baskets does not play a big role, but the fewer the baskets, the faster the algorithm learns (try changing the number of baskets (N_BINS) to 3, 30 in the example and you will see that the difference is really small)
  • , , , “” , , .

    ( , “” , “” . . , )



The reshaping algorithm is on the author's github ( xam folder ). In the future, the author plans to analyze new topics and share them in the blog.



I hope the translation and notes were helpful and clear. I look forward to your feedback in a constructive format. Thank you for your time.



Footnotes:



1. Deezer is a French online music streaming service. Like Spotify, Ya Music and you get the idea



2. XGBoost- extreme gradient boosting algorithm. I absolutely loved calling it “gradient boosting on steroids”. The idea of ​​boosting is to train several homogeneous weak students, each of whom forms grades based on the retrospective learning experience of the previous one, paying attention to those classes where the previous algorithm stumbled the most. The idea behind a gradient is, in a simple word, to minimize learning errors. XGBoost, as an algorithm, is a more computationally advantageous configuration of Gradient Boosting



3. Distribution here means exactly the thing that describes the law by which numbers are scattered in a variable.



4. In my personal opinion, the most understandable spur for visualizing exponentialdistribution in the head is its definition as distribution, with constant intensity.



5. ROC-AUC (Area Under receiver operating characteristic Curve) - the area under the curve of “receiver processing characteristics” - a literal name, since the metric came from signal processing theory. The ROC curve is very steep - it demonstrates the ratio of True Positive and False Positive of model responses as the probability threshold for assignment to a class changes, forming an “arc”. Due to the fact that the ratio of TP and FP can be seen, the optimal threshold of probabilities can be selected depending on errors of the 1st and 2nd kind.



In the case of considering the accuracy of the model, not paying attention to the probability threshold of responses, the ROC-AUC metric is used, which takes values ​​in the range [0,1]. For a constant model with a balance of classes, the ROC-AUC will be approximately equal to 0.5, therefore, models below do not pass the sanity check (sanity check). The closer the area under the ROC curve to one, the better, but for indexing the usefulness of the results in general, it is relevant to compare the AUC-ROC of the trained model with the AUC-ROC of the constant model.



All Articles