Hello! I am a web developer and have been interested in machine learning for several years now. Because in my day-to-day work park I have to solve problems that are less interesting to me, not related to machine learning, from time to time I forget what I once read or used. To create a memo for myself, strengthen my knowledge and share it with others, I decided to write this series of articles on machine learning. I'll start with data preprocessing.
In this article I will talk about what problems happen with data, how to solve them, as well as about the most commonly used methods of preparing data before feeding it to different models.
Skips
Consider the following dataset. I honestly invented it and will refer to it further in this article.
ID | Name | Sports discipline | A country | Athlete's year of birth | Athlete weight | Medal |
---|---|---|---|---|---|---|
1 | Ivan | Rowing | Russian Federation | 1985 | 265 | B |
2 | Boxing | Great Britain | 1986 | 54 | S | |
3 | Kim | Greco-Roman wrestling | North Korea | 1986 | 93 | G |
4 | Oleg | Greco-Roman wrestling | 1984 | B | ||
five | Pedro | Rowing | Brazil | 97 | N | |
6 | Valery | Rowing | Russian Federation | 2004 | 97 | N |
, . . β , . , .
, "" , . , , .
β , - , . , "" "" . , , - - . : , , , .
. , . .
, . , . : , . β , .
:
- "".
ID | ||||||
---|---|---|---|---|---|---|
2 | 1986 | 54 | S |
- .
ID | ||||||
---|---|---|---|---|---|---|
4 | - | 1984 | B |
, :
- .
ID | ||||||
---|---|---|---|---|---|---|
4 | - | 1984 | (265 + 54 + 93 + 97 + 97) / 5 = 121.2 | B |
, " " 1 .
- . , .
ID | ||||||
---|---|---|---|---|---|---|
4 | - | 1984 | (54, 93, 97, 97, 265) = 97 | B |
, . , , . , "" . β ( ).
ID | ||||||
---|---|---|---|---|---|---|
1 | 1985 | 265 | B |
, , , , . :
β β , 25% . β β , 75% .
, , , : :
.
β . , [0, 1]. , . , . (, , ) .
. , , . , .
Z-. Z- :
β X.
Z- .
M[X] β X.
, Z- , .
One-hot encoding
. . , " " - . : . . . ( ).
, , ? . , " " 1, "" β 2. . , , . , . .
, , . , "" 4 :
ID | _ | _ | _ | _ | ||
---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | ||
2 | 0 | 1 | 0 | 0 | ||
3 | 0 | 0 | 0 | 1 | 0 | |
4 | 1 | 0 | 0 | 0 | ||
5 | 0 | 0 | 0 | 1 | ||
6 | 1 | 0 | 0 | 0 |
, , .
, . . , , . . , .
Thanks for reading or browsing to here. I have described not all preprocessing methods, and this article is hardly useful for professional data scientists. However, if you are a beginner and do not know what to do with your data, you can safely return here. Good luck with your learning and interesting tasks!
List of sources
I am not a scientist and this article does not claim to be scientific. Therefore, I will not draw up sources according to GOSTs. Please excuse me for this.