Data preprocessing

Hello! I am a web developer and have been interested in machine learning for several years now. Because in my day-to-day work park I have to solve problems that are less interesting to me, not related to machine learning, from time to time I forget what I once read or used. To create a memo for myself, strengthen my knowledge and share it with others, I decided to write this series of articles on machine learning. I'll start with data preprocessing.



In this article I will talk about what problems happen with data, how to solve them, as well as about the most commonly used methods of preparing data before feeding it to different models.



Skips



Consider the following dataset. I honestly invented it and will refer to it further in this article.



ID Name Sports discipline A country Athlete's year of birth Athlete weight Medal
1 Ivan Rowing Russian Federation 1985 265 B
2 Boxing Great Britain 1986 54 S
3 Kim Greco-Roman wrestling North Korea 1986 93 G
4 Oleg Greco-Roman wrestling 1984 B
five Pedro Rowing Brazil 97 N
6 Valery Rowing Russian Federation 2004 97 N


, . . β€” , . , .



, "" , . , , .



β€” , - , . , "" "" . , , - - . : , , , .



. , . .



, . , . : , . β€” , .





:



  • "".


ID
2 1986 54 S


  • .


ID
4 - 1984 B




, :



  • .


ID
4 - 1984 (265 + 54 + 93 + 97 + 97) / 5 = 121.2 B


, " " 1 .



  • . , .


ID
4 - 1984 (54, 93, 97, 97, 265) = 97 B




, . , , . , "" . β€” ( ).



ID
1 1985 265 B


, , , , . :



IQR=Q3-Q1,



Q1 β€” β€” , 25% . Q3 β€” β€” , 75% .



, , , : :



[Q1-1.5IQR,Q3+1.5IQR]



.





β€” . , [0, 1]. , . , . (, , ) .



. , , . , .



xnew=xold-xminxmax-xmin



Z-. Z- :



(-3Οƒ[X],3Οƒ[X]),



Οƒ[X] β€” X.



Z- .



xnew=xold-M[X]Οƒ[X]



M[X] β€” X.



, Z- , .



One-hot encoding



. . , " " - . : . . . ( ).



, , ? . , " " 1, "" β€” 2. . , , . , . .



, , . , "" 4 :



ID _ _ _ _
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1 0
4 1 0 0 0
5 0 0 0 1
6 1 0 0 0


, , .





, . . , , . . , .



Thanks for reading or browsing to here. I have described not all preprocessing methods, and this article is hardly useful for professional data scientists. However, if you are a beginner and do not know what to do with your data, you can safely return here. Good luck with your learning and interesting tasks!



List of sources



I am not a scientist and this article does not claim to be scientific. Therefore, I will not draw up sources according to GOSTs. Please excuse me for this.



  1. Lecture course from Yandex and HSE "Introduction to Machine Learning" on the cursor.
  2. Standardization, or mean removal and variance scaling - sklearn library documentation
  3. Advanced machine learning data preparation tasks - Microsoft



All Articles