Feature selection in machine learning

Hello, Habr!

We at Reksoft have translated the article Feature Selection in Machine Learning into Russian . We hope it will be useful to everyone who is not indifferent to the topic.

In the real world, data isn't always as clean as business customers sometimes think. That is why data mining and data wrangling are in demand. It helps to identify missing values ​​and patterns in query-structured data that cannot be identified by humans. Machine Learning comes in handy to find and use these patterns to predict outcomes using discovered data connections.

To understand any algorithm, you need to look at all the variables in the data and figure out what those variables represent. This is critical because the rationale for results is based on understanding the data. If your data contains 5 or even 50 variables, you can examine them all. What if there are 200 of them? Then there simply won't be enough time to examine each individual variable. Moreover, some algorithms do not work for categorical data, and then all categorical columns will have to be quantified (they may look quantitative, but the metrics will show that they are categorical) in order to add them to the model. Thus, the number of variables increases, and there are about 500 of them. What to do now? You might think that dimensionality reduction is the answer. Dimension reduction algorithms reduce the number of parametersbut negatively affect interpretability. What if there are other techniques that eliminate the traits while still making the rest easy to understand and interpret?

Depending on whether the analysis is based on regression or classification, feature selection algorithms may differ, but the main idea of ​​their implementation remains the same.

Strongly correlated variables

Variables that are highly correlated with each other provide the model with the same information, therefore, it is not necessary to use all of them for analysis. For example, if the dataset contains the attributes "Online Time" and "Traffic Used", we can assume that they will be somewhat correlated, and we will see strong correlation even if we choose an unbiased sample of data. In this case, only one of these variables is needed in the model. If both are used, the model will be overfit and biased towards one particular feature.

P-values

, , β€” . p-, . , p-, - , , , , (target).

β€” , . , , , . , , . . p-, . , , ( ).

, . . ( ), . p- . .

RFE / . , Β« Β» , ; ( 200-400), , - , . RFE . . . , RFE , ( , , , ).

, ( p-) ( , ). , , Random Forest, LightGBM XG Boost, , Β« Β». , .

(bias) (variance). , (overfit) . , . , . ! :

L1 β€” : (.. ). , , , (.. , ).

L2 β€” Ridge: Ridge . Ridge , .

Ridge , , , Elastic-Net.

, : . β€” , , , .

! !




All Articles