Learning data science from scratch: milestones and milestones

In accordance with the concept of a spiral approach to learning, which I wrote about earlier , here is a step-by-step plan for mastering data science. The plan is designed so that each stage makes the student more or less better equipped for real-world tasks. This approach is inspired by the ideology of Agile in the sense that no matter how much time you spend on learning and at any time you stop, you will have the most in-demand set of knowledge that you can master in such a time.





This logic can also be explained from the opposite: it is dangerous to immediately learn neural networks, because a person who knows something about them, but who does not know how to effectively apply it in practice, is not in demand. And, for example, you can get a set of knowledge in 300 hours that does not make you a good enough specialist to solve any real problems, and will not even allow you to find a starting job to continue developing these skills.





If you study according to the proposed plan, at each stage there will be an increased chance of finding such a job, so that the skills necessary for further growth could also be improved in the course of real work.





For each stage, I indicate the reference labor costs, subject to a more or less effective approach and studying only the minimum required volume (with an ineffective approach, each point can be taught ten times longer)





Later, in a separate article I will write good courses and books corresponding to each stage (some courses and books for the first stages have already been indicated earlier in the first article of this series).





Stage 1. Basic data analysis tools: SQL, Excel

  • SQL Basics (20h). Knowledge of SQL on your own can come in handy for a ton of other tasks. And it, in any case, is necessary for a large (most?) Part of vacancies for the position of an analyst, data scientist and, moreover, a machine learning engineer.





  • Excel basics (10h): filters and sorting data, formulas, vlookup, pivot tables, basic work with charts. Colleagues, partners, or management will be submitting inputs to excel, and you will need to be able to quickly understand and study them. Often, it is more convenient to prepare and present the results of the analysis done in Python in Excel.





  • (20-200, ), .. , pandas/scikit, Python .





: / / -. , , 100, 50-70 .





2. Python Pandas

  • Python (80). . .





  • pandas (20 ) - . : , , ,





  • API (requests, beatiful soup)





Python API, -.





, . , -, ( ). , , .





3.

( 200-400 , )





  :





  • -





  • Overfitting









  • Data leakage





  • ( )









, :





  • :

















    • Random forest









    • kNN





  • : k-means





  • :





  • : PCA





: dummy , one-hot encoding, tf-idf









:





  • : , ( ).





  • : "correlation does not imply causation", .





  • ., ,   : . (max likelihood), (log-likelihood). ( log log-odds), ( ""). , , . , . . , senior, :





. - (/) .





4.

- scikit-learn, pandas (numpy).





, . 100-300. - , .





feature engineering





junior data scientist. . . senior , .





, CNN, RNN/LSTM , vector embeddings. , . " " , , , , , .





, .





20-40, .





5. ,  

60-200, . , , , .. ,





  • Conda, , conda





  • bash





  • Python standard library, ( itertools, collections, contextlib), , ; context managers.





  • git, IDE: pycharm/vs code. git,





  • (matplotlib+seaborn, plotnine, plotly), .





( -, ).





, , , , (feature engineering), , , (xgboost, cat-boost). . Senior .





, 2-5

- , . , , . ( ):





  • matplotlib – , . , , -. , , "" - .





  • seaborn - , . .





  • plotnine - . - , . - seaborn , matplotlib , . , , plotnine . plotly - .





  • plotly - . . , , ().





, 10-20 .





, - PowerBI Tableau, . , , 60. SQL+Excel+PowerBI/Tableau "BI-" c 100 ., 150 . .





, ,





  • regular expressions, aka RegExp (10). regexp .





  • PySpark (40 , 100-200 ) . , (). Big data. , .. . ( , ).





    Spark , , , .. SQL , , API pandas. , . Koalas, pandas spark-, Spark.





  • html - , , , .





6

, , , .





  • Python :  , , , dunderscore ____ .





  • bash, linux





  • docker





  . , , . .. , .





- (, EDA ). , . , . , , .. . , : , , .





, . , .. . , "" .





:  . 50, , , .





, , data science









, ,









, ( 2 ) , , .





, , , , . , , , . , .





, , , .





self.development.mentor in the gmail.com domain, Oleg 








All Articles