Table of contents:
Introduction
1. Mathematics and statistics
2. Fundamentals of programming
3. Algorithms and concepts of machine learning
4. Projects in the field of data science
Introduction
My guess is that as a budding data scientist, you will want to fully understand the concepts and details of various machine learning algorithms, data science concepts, and so on.
Therefore, I recommend that you start with the base before you even look at machine learning algorithms or data analysis applications. If you do not have a basic understanding of calculus and integrals, linear algebra and statistics, it will be difficult for you to understand the underlying mechanics of the various algorithms. Likewise, if you don't have a basic understanding of Python, it will be difficult for you to translate your knowledge into real-world applications. Below are the order of the topics that I recommend studying:
- Mathematics and statistics.
- Basics of programming.
- Machine learning algorithms and concepts.
1. Mathematics and statistics
As with everything else, you should learn the basics before getting into the fun stuff. Trust me, it would be much easier for me if I started by learning math and statistics before getting started with some machine learning algorithms. Three general topics that I recommend looking at are calculus / integrals, statistics, and linear algebra (in no particular order).
Integrals
Integrals are important when it comes to probability distribution and hypothesis testing. While you don't need to be an expert, it's in your best interest to learn the basics of integrals. The first two articles are intended for those who want to get an idea of what integrals are, or for those who just need to brush up on their knowledge. If you know absolutely nothing about integrals, I recommend that you take the Khan Academy course. Finally, here are links to a number of practical tasks to hone your skills:
- Introduction to integrals (article).
- A crash course on integrals (article).
- Khan Academy: Integral Calculus (course).
- Practical Questions (start with block 6).
Statistics
If there is any topic that you should focus on, it is statistics. After all, a data scientist is a truly modern statistician, and machine learning is a modern term for statistics. If you have time, I recommend that you take the Georgia Tech course entitled Statistical Techniques , which covers the basics of probability, random variables, probability distribution, hypothesis testing, and more. If you don’t have time to devote yourself to this course, I highly recommend watching the Khan Academy videos on statistics .
Linear algebra
Linear algebra is especially important if you want to dive into deep learning, but even then it is useful to know for other fundamental machine learning concepts such as principal component analysis and recommender systems. For mastering linear algebra, I also recommend Khan Academy !
2. Fundamentals of programming
Just as a fundamental understanding of math and statistics is important, a fundamental understanding of programming will make your life so much easier, especially when it comes to implementation. Therefore, I recommend that you take the time to learn basic languages - SQL and Python, before diving into machine learning algorithms.
SQL
It doesn't matter where to start, but I would start with SQL. Why? It is easier to learn and useful to know if you are employed in a data company, even if you are not a data scientist.
If you are new to SQL, I recommend checking out Mode's SQL tutorials , as they are very concise and detailed. If you want to learn more advanced concepts, check out the list of resources where you can learn advanced SQL .
Below are a few resources that you can use to practice SQL:
Python
I started out with Python and will probably stay with this language for the rest of my life. It is far ahead in terms of Open Source contributions and is easy to learn. Feel free to turn to R if you want, but I have no opinions or advice on R. I have found that learning Python through practice is much more rewarding. Nevertheless, after taking several Python crash courses, I came to the conclusion that this course is the most complete (and free!).
Pandas
Perhaps the most important library to know is Pandas, which is specifically designed for data manipulation and analysis. Below are two resources that should accelerate your learning curve. The first link is a tutorial on how to use Randas, and the second link contains many practical tasks that you can solve to solidify your knowledge!
3. Algorithms and concepts of machine learning
If you've gotten to this part of the article, it means you've built your foundation and are ready to learn interesting things. This part is split into two others: machine learning algorithms and machine learning concepts.
Machine learning algorithms
The next step is to learn about the various machine learning algorithms, how they work and when to use them. Below is a partial list of the various machine learning algorithms and resources that you can use to learn each of them.
- Linear Regression ( Georgia Tech , StatQuest ).
- Logistic regression ( StatQuest ).
- K nearest neighbors ( MIT ).
- Decision trees ( StatQuest ).
- Naive Bayes ( Terence Sheen , Luis Serrano ).
- Support Vector Machines ( SVM Tutorial by Alice Zhao ).
- Neural networks ( Terence Sheen ).
- Random forests ( StatQuest ).
- AdaBoost ( Terence Sheen , StatQuest ).
- Gradient boosting ( StatQuest ).
- XGBoost ( StatQuest ).
- Principal component analysis ( StatQuest ).
Machine learning concepts
Plus, there are a few fundamental concepts of machine learning that you will want to learn as well. Below is a (non-exhaustive) list of concepts that I highly recommend learning. Many interview questions are based on these topics!
- Regularization .
- The bias - variance dilemma .
- Confusion matrix and related metrics .
- Area under the ROC and ROC curve (video) .
- Bootstrap fetch .
- Ensemble training, bagging and boosting .
- Normalization and standardization .
4. Projects in the field of data science
By this point, you will not only have built a solid foundation, but you will also have a solid understanding of the fundamentals of machine learning. Now it's time to work on some personal side projects. If you want to see some simple examples of data science projects, check out some of my projects:
- Predicting Wine Quality Using Classification Methods ( article , Github ).
- Visualizing Coronavirus Data with Plotly ( article , Github ).
- Movie Recommendations System with Collaborative Filters ( Github ).
Here is a list of Data Science projects you can look at to come up with an interesting side project.
I hope this post will give you direction and help in your career in Data Science. There is no silver bullet, so feel free to take this post with a grain of salt, but I do believe that learning the basics will pay off in the future. And the promo code HABR will add 10% to the training discount shown on the banner.