Datascience is not only fit-predict
Imagine that you start working for a company that performs repetitive operations with infinite tables. For example, in a large retailer or a leading telecom operator. Every day, you are given the task of finding out whether the customer will stay with you or whether there will be enough goods on the shelves until the end of the week. The algorithm looks simple. You take a sample, study endless rows of features, remove garbage, generate new features, collect a pivot table. You submit the ready data to the model, configure the parameters and look forward to the coveted figures of the final metric. This repeats itself day after day. Spending only 60 minutes every day on generating features or selecting parameters, you will spend at least 20 hours per month. This is almost a whole day, during which you can complete a new task, train a neural network, or read several articles on arxiv.
, . . . . , . , , . . , , . , . , .
TL:DR
: featuretools, optuna H2O AutoML.
pandas sklearn. . . DS- . .
kaggle. . Kaggle , . . , optuna .
. ? , . , groupby() sum() – . – . , . , .
featuretools (. « »). , .
: feature primitives (fp, . « ») deep feature synthesis (dfs, . « »). . , , . (“depth”) . , , .
. .
pip install featuretools
import featuretools as ft
Featuretools API . , :
ft.primitives.list_primitives()
«» .
. . featuretools EntitySet (. « »), id. Id , . , – . .
es = ft.EntitySet(id=’data’) # EntitySet
es.entity_from_dataframe(entity_id = ‘january’, #
dataframe=df_train.drop(‘target’, axis=1),
index=1)
feature_matrix, feature_defs = ft.dfs(entityset=es, #
target_entity=’january’,
trans_primitives=[‘add_number’, ‘multiply_numeric’],
verbose=1)
, pandas. «» ? pandas feature_matrix, . .
. . , . , .
sklearn — GridSearchCV RandomizedSearchCV. GridSearchCV -. RandomizedSearchCV . . . , . .
, , , . «» : optuna, hyperopt, scikit-optimization.
– optuna.
sklearn (xgboost, lightgbm, catboost), , pytorch, tensorflow, chainer mxnet. sklearn , optuna, , . , - - . .
optuna . . « » arxiv.org . .
# ( )
pip install optuna
#
import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances
. RMSE = 0.73562. optuna 5 . . .
– , . :
‘reg_alpha’: trial.suggest_loguniform(‘reg_alpha’, 1e-3, 10.0)
#
‘num_leaves’: trial.suggest_int(‘num_leaves’, 1, 300)
#
‘max_depth’: trial.suggest_categorical(‘max_depth’, [-1,10,20])
# ,
:
study = optuna.create_study(direction=’minimize’) #
study.optimize(objective, n_trials=5) # objective – , 5 –
#
kaggle optuna. 5 0.1 RMSE. 5 . , GridSearchCV . = 0.7221, .
Machine Learning 7
pip install –f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
from h2o.automl import H2OAutoML
#
h2o.init(max_mem_size=’16G’)
Pandas ! AutoML.
train = h20.import_file(‘../input/tabular-playground-series-jan-21/train.csv’)
test = h20.import_file(‘../input/tabular-playground-series-jan-21/test.csv’)
.
x = test.columns[1:]
y = ‘target’
. . , random_state=47 = 3100 (3600 )
aml = H20AutoML(max_models=2, #
seed=SEED,
max_runtime_secs = 3100) #
aml.train(x=x, y=y, training_frame=train)
# , AutoML
! , . . . kaggle. . , pandas.
preds = aml.predict(test)
df_sub[‘target’] = preds.as_data_frame().values_flatten()
df_sub.to_csv(‘h2o_submission_5.csv, index=False)
AutoMl = 0.71487. , (root mean squared error, RMSE) . – . RMSE = 0.69381. – . AutoML .
: , . , .
Parameter optimization and automated training increased the quality of the final RMSE metric in comparison with the baseline model by 0.2 points. For simple synthetic data, this is an adequate result that demonstrates the applicability of the methods to serious problems.
Of course, the frameworks studied cannot completely replace a competent specialist, but they help in daily routine activities. I highly recommend reading the complete documentation for each library. Use your time wisely and have fun.