Every person who has ever encountered machine learning algorithms knows that even simple ML models on a large amount of data can be trained for an unacceptably long time. The tasks of restoring dependencies, classifying objects turn into minutes or even hours of network training.
This article will demonstrate how, using the example of algorithms taken from the Scikit-Learn library, you can expand the learning capabilities of neural networks by using the accelerated calculations of the daal4py library.
Introduction
Scikit-Learn provides a solid set of tools for solving Machine Learning problems. Classification, Regression, Clustering ... sklearn has algorithms for all of this. We will work with some of these algorithms.
In 2019, the daal4py library is being formed based on the Intel Data Analytics Acceleration Library (DAAL) . Intel presented a solution directly related to predictive data analysis, which has a significant advantage among peers due to performance and ease of use.
Daal4py technology allows to increase the performance of
classic sklearn methods due to accelerated calculations (in particular matrix transformations) based on Intel DAAL.
Implementation
Let's look at daal4py.sklearn methods on a test problem.
Dataset published on kaggle: Cardiovascular Disease dataset
The task is to create a model capable of predicting the presence or absence of cardiovascular diseases in humans.
This task is a classification task, so it was decided to use ensamble from the LogisticRegression, RandonForestClassifier and KNeighborsClassifier models passed through the GridSearchCV parameter fitting tool, Scikit-Learn implementation.
First, let's train both implementations of the algorithms using the same parameters and compare them:
:
from sklearn.model_selection import GridSearchCV
# Best Logistic Regression
def get_best_clf_lr(name, clf, params):
start = timer()
grid_clf = GridSearchCV(estimator=clf, param_grid=params, n_jobs=-1)
grid_clf.fit(X_train, y_train)
end = timer()
learning_time = end - start
print(learning_time)
return name, learning_time, grid_clf.best_estimator_
# Best Random Forest Classifier
def get_best_clf_rf(name, clf, params):
start = timer()
grid_clf = GridSearchCV(estimator=clf, param_grid=params, n_jobs=-1, cv=5)
grid_clf.fit(X_train, y_train)
end = timer()
learning_time = end - start
print(learning_time)
return name, learning_time, grid_clf.best_estimator_
# Best K Neighbors Classifier
def get_best_clf_knn(name, clf, params):
start = timer()
grid_clf = GridSearchCV(estimator=clf, param_grid=params, n_jobs=-1)
grid_clf.fit(X_train, y_train)
end = timer()
learning_time = end - start
print(learning_time)
return name, learning_time, grid_clf.best_estimator_
, , , . sklearn daal4py . RandomForestClassifier :
from sklearn.ensemble import RandomForestClassifier as RandomForestClassifier_skl
from daal4py.sklearn import ensemble
# Random Forest Classifier
params_RF = {
'n_estimators': [1, 3, 5, 7, 10],
'max_depth': [3, 5, 7, 9, 11, 13, 15],
'min_samples_leaf': [2, 4, 6, 8],
'min_samples_split': [2, 4, 6, 8, 10]
}
name, lrn_time, model = get_best_clf_lr("RF_sklearn", RandomForestClassifier_skl(random_state = 42), params_RF)
learn_data_skl.append([name, model, lrn_time])
name, lrn_time, model = get_best_clf_lr("RF_daal4py", ensemble.RandomForestClassifier(random_state = 42), params_RF)
learn_data_daal.append([name, model, lrn_time])
. KNeigborsClassifier, 45. , , , - . 1.5 — 2 .
, RandomForestClassifier , 34% .
ensamble .
ROC_AUC score.
, feature engineering, , roc_auc_score 0.74 .
Fast Matrix Transforms library daal4py accelerates machine learning models and expands learning capabilities without wasting time. On this test problem, it was possible to increase the number of enumerated parameters, as well as increase the training sample by 34%, without changing the execution time.