CatBoost and ML Contests

Data Analysis and Base Model





Introduction

This article is based on data from a competition that Driven Data published to address water source problems in Tanzania. 





The information for the competition was obtained by the Tanzanian Ministry of Water Resources using an open source platform called Taarifa. Tanzania is the largest country in East Africa with a population of about 60 million. Half of the population does not have access to clean water, and 2/3 of the population suffers from poor sanitation. In poor homes, families often have to spend hours on foot to get water from water pumps. 





Billions of dollars in foreign aid are being provided to tackle Tanzania's freshwater problem. However, the Tanzanian government has not been able to resolve this problem to this day. A significant part of the water pumps are completely out of order or practically do not work, and the rest require major repairs. Tanzania's Ministry of Water Resources agreed with Taarifa, and they launched a competition in the hopes of getting community clues on how to accomplish their tasks.





Data

There are many characteristics (features) associated with water pumps in the data, there is information related to the geographical locations of points with water, the organizations that built and manage them, as well as some data about the regions, local government territories. There is also information about the types and number of payments. 





Water supply points are divided into serviceable , non-functional and serviceable, but in need of repair . The goal of the competition is to build a model predicting the functionality of water supply points.

The data contains 59,400 rows and 40 columns. The target label is contained in a separate file.

The metric used for this competition is the classification rate , which calculates the percentage of rows where the predicted class matches the actual class in the test set. The maximum value is 1 and the minimum is 0. The goal is to maximize the classification rate .





Data analysis

Descriptions of fields in the data table:





  • amount_tsh - total static head (amount of water available to the water supply point)





  • date_recorded — 





  • funder — 





  • gps_height — 





  • installer — 





  • longitude — GPS ()





  • latitude — GPS ()





  • wpt_name — ,





  • num_private — 





  • basin — 





  • subvillage — 





  • region — 





  • region_code —  ()





  • district_code — () 





  • lga — 





  • ward —





  • population — 





  • public_meeting — /





  • recorded_by —





  • scheme_management — 





  • scheme_name — 





  • permit — 





  • construction_year — 





  • extraction_type — 





  • extraction_type_group — 





  • extraction_type_class — 





  • management — 





  • management_group — 





  • payment — 





  • payment_type — 





  • water_quality — 





  • quality_group — 





  • quantity — 





  • quantity_group — 





  • source — 





  • source_type — 





  • source_class — 





  • waterpoint_type — 





  • waterpoint_type_group — 





    ,  —  :





    , . :





    • (under-sampling)





    • , (over-sampling)





    •  —  (SMOTE)





    • ,





    .





    , .





    .





    , scheme_name, , .





    / . permit, installer funder.





    .





    , . , (quantity_group).





    , , . . , . , , , .





    ? , quality_group.





    , , . .





    quality_group .





     —  (waterpoint_type_group).





    , other . ? , .





     —  , , , 80- . 





    . , . , 500 .





    Danida —  , , . RWSSP ( ), Dhv . , , , . , , , . , , .





    , , , . , .





     —  . .





    , . , .





     —  .





    , , , , - .





    0 . , amount_tsh (label = 0). amount_tsh. , 500 .





, .





  • installer , . . .





  • , 71 (0,95 ), «other».





  • funder.  — 98.





  • . . , . : scheme_management, quantity_group, water_quality, payment_type, extraction_type, waterpoint_type_group, region_code.





  • latitude longitude region_code.





  • subvillage scheme_name.





  • public_meeting permit .





  • subvillage, public_meeting, scheme_name, permit, , . , , .





  • scheme_management, quantity_group, water_quality, region_code, payment_type, extraction_type, waterpoint_type_group, date_recorded, recorded_by , , .





. , , CatBoost. .





, . .





def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        max_ctr_complexity=5,
        task_type='CPU',
        iterations=10000,
        eval_metric='AUC',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=1000,
        plot=False,
        use_best_model=True)
      
      



AUC, , .





.  — 





def classification_rate(y, y_pred):
    return np.sum(y==y_pred)/len(y)
      
      



,  —  . OOF (Out-of-Fold). ; . , .





def get_oof(n_folds, x_train, y, x_test, cat_features, seeds):    ntrain = x_train.shape[0]
    ntest = x_test.shape[0]  
        
    oof_train = np.zeros((len(seeds), ntrain, 3))
    oof_test = np.zeros((ntest, 3))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, 3))    test_pool = Pool(data=x_test, cat_features=cat_features) 
    models = {}    for iseed, seed in enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle=True,
            random_state=seed)          
        for i, (train_index, test_index) in enumerate(kf.split(x_train, y)):
            print(f'\nSeed {seed}, Fold {i}')
            x_tr = x_train.iloc[train_index, :]
            y_tr = y[train_index]
            x_te = x_train.iloc[test_index, :]
            y_te = y[test_index]
            train_pool = Pool(data=x_tr, label=y_tr, cat_features=cat_features)
            valid_pool = Pool(data=x_te, label=y_te, cat_features=cat_features)model = fit_model(
                train_pool, valid_pool,
                loss_function='MultiClass',
                random_seed=seed
            )
            oof_train[iseed, test_index, :] = model.predict_proba(x_te)
            oof_test_skf[iseed, i, :, :] = model.predict_proba(x_test)
            models[(seed, i)] = modeloof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
    oof_train = oof_train.mean(axis=0)
    return oof_train, oof_test, models
      
      



,  —  seeds.





Learning curve of one of the folds
 

, .





, ().





:





balanced accuracy: 0.6703822994494413
classification rate: 0.8198316498316498
      
      



.





, -5 0,005 , , .





,  —  . , , .





balanced accuracy: 0.6549535670689709
classification rate: 0.8108249158249158
      
      



.





:





  • ;





  • ;





  • CatBoost, ;





  • OOF-;





  • .





The right approach to preparing data and choosing the right tools for creating a model can give excellent results even without additional feature generation.





As a homework assignment, I suggest adding new features, choosing the optimal model parameters, using other libraries to boost the gradient, and building ensembles from the resulting models.





The code from the article can be viewed here .








All Articles