Once I received an offer from Deliveroo, in this company I was supposed to become a Data Science Manager. While I was preparing to take up my duties, the offer was withdrawn. At that time I did not have an airbag in case of prolonged unemployment. I will share with you everything that ultimately helped me get two offers for the Data Scientist position from Facebook at once. I hope this will help one of you to get out of the difficult situation in which I found myself a few months ago.
1. Organization is the key to everything
I went to interviews at Google (and DeepMind), Uber, Facebook, Amazon for everything that was somehow connected with the position of Data Scientist. Most of the time I was asked questions from industries such as
- Software development
- Applied statistics
- Machine learning
- Data processing, management and visualization
Nobody expects you to be super pro for all of these industries. But you must understand them enough to convince the interviewer of your competence and the right to take the proposed position. How deeply you have to understand the topic depends on the job itself, but since this is a very competitive field, any knowledge will come in handy.
I recommend using Notion to organize your interview preparation. This tool is versatile, plus it allows you to apply techniques such as spaced repetition and active recall. They help reinforce learning outcomes and uncover key questions that come up over and over again in a Data Scientist interview. Ali Abdaal has a great guidefor taking notes with Notion. Helps to maximize your potential during the interview.
I was constantly repeating my notes at Notion, especially actively - just before the interview. This allowed me to be confident in my abilities and that the key topics and terms are in my "working memory", so that I do not have to waste precious time, meaningfully saying "nuuuuuu" after some questions.
2. Software development
You don't always need to answer questions about the time complexity of an algorithm. But for any data Scientist job, you have to write code. Data Science, as you know, is not one profession, but many, this industry attracts talent from a variety of areas, including software development. Accordingly, you will have to compete with programmers who understand the nuances of writing efficient code. I would recommend spending 1-2 hours a day before the interview, mastering and / or strengthening knowledge and skills in such topics:
- Arrays.
- Hash tables.
- Linked Lists.
- Method of two pointers.
- String algorithms (employers LOVE this topic).
- Binary search.
- Divide and conquer algorithms.
- Sorting algorithms.
- Dynamic programming.
- Recursion.
Don't study algorithms in a formal way. This is useless, because the interviewer may ask a question about the nuances of some algorithm, and you will get lost. Instead, it’s better to master the foundation that each algorithm works. Explore computational and spatial complexity and understand why all of these are important for creating quality code.
Interviewers do have a lot to ask about algorithms, so it's worth learning the basics and common case studies to make it easier to respond to interviews later.
Try to answer every possible question, even if it takes a long time. Then look at the decision model and try to determine the optimal strategy. Then look at the answers and try to understand why this is so? Ask yourself questions like "why the average time complexity of Quicksort is O (n²)?" or "Why two pointers and one for loop make more sense than three for loops?"
3. Applied statistics
Applied statistics play an important role in Data Science. How important will depend on the position you are applying for. Where is applied statistics actually used? Wherever it is necessary to organize, interpret and extract information from data.
During interviews, I advise you to carefully study the following topics:
- ( , , , ).
- (, , 5 10 ).
- ( A / B-, T-, , - . .).
- ( , ).
- ( / ).
If you think this is a huge amount of information to study, then you do not think. I was amazed at how much you can ask for an interview and how much you can find online to help you with your preparation. Two resources helped me to cope:
- Introduction to Probability and Statistics is a free course that covers everything described above, including questions and a self-test exam.
- Machine Learning: A Bayesian and Optimization Perspective . This is more of a machine learning course than applied statistics. But the linear algebra approaches described here help to understand the nuances of the regression analysis concept.
It’s best not to learn it by rote. You need to solve as many tasks as you can. Glassdoor is a great repository for applied statistics questions that you usually come across in interviews. The most difficult interview I had was the interview at G-Research. But I really enjoyed preparing for it, and Glassdoor helped me to understand how far I have progressed in mastering this topic.
4. Machine learning
Now we come to the most important thing - machine learning. But this topic is so vast that you can simply get lost in it.
Below are some resources that will provide a very solid foundation to get you started with machine learning. Here is a far from exhaustive set of topics, ranked by topic.
Metrics - classification
- Confusion matrices, accuracy, precision, recall, sensitivity
- F1-score
- TPR, TNR, FPR, FNR
- I II
- AUC-Roc
—
-, Over/Under-Fitting
Sampling
Hypothesis Testing
This topic is more about applied statistics, but it is extremely important , in particular for A / B testing.
Regression Models
There is a wealth of information available about linear regression. You should familiarize yourself with other regression models:
- Deep neural networks for regression problems
- Random forest regression
- XGBoost Regression
- ARIMA / SARIMA
- Bayesian Linear Regression
- Gaussian process regression
Clustering algorithms
Classification models
- Logistic regression (most important, fix well)
- Multiple regression
- XGBoost
- Support vector machine
That's a lot, but it doesn't look so scary if you understand applied statistics. I would recommend learning the nuances of at least three different classification / regression / clustering methods, because the interviewer can always ask (and does), "What other methods could we use, what are some of the advantages / disadvantages?" This is only a fraction of the knowledge, but if you know these important examples, interviews will go much smoother.
5. Data processing and visualization
"Tell us about the stages of data processing and cleaning before applying machine learning algorithms."
We are provided with a specific set of data. First and foremost is proving that you can accomplish the EDA. It is best to use Pandas, it is, if used correctly, the most powerful tool in the data analysis toolbox. The best way to learn how to use Pandas to process data is to download many, many datasets and work with them.
In one of the interviews, I needed to load a dataset, clean it up, render, select, build and evaluate a model - all in one hour. It was really crazy, we were very hard. But I was just practicing doing all of this for a few weeks, so I knew what to do, even if I lost the thread.
Organizing Data
There are three important things in life: death, taxes, and getting a request to merge datasets. Pandas is almost perfect for the job, so please practice, practice, practice.
Data profiling
This task involves understanding the "meta" characteristics of the dataset, such as the shape and description of the numeric, categorical, and temporal characteristics in the data. You should always strive to answer a series of questions like “how many observations do I have”, “what does the distribution of each function look like”, “what do these functions mean”. This kind of early profiling can help you ditch irrelevant features from the start, such as categorical features with thousands of levels (names, unique identifiers), and reduce the amount of work for you and your computer down the road (work smart, not hard, or woke up somehow).
Data visualization
Here you ask yourself: "What does the distribution of my functions look like in general?" Quick tip: if you didn't learn about box plots in the applied statistics part of the tutorial, now is the time because you need to learn how to identify outliers visually. Kernel density histograms and graphs are extremely useful tools when viewing the properties of the distributions of each function.
Then we might ask “what the relationship between my functions looks like,” in which case Python has a package called seaborn that contains cool and powerful tools like pairplot and a nice heatmap for correlation plots.
Handling null values, syntax errors, and duplicate rows / columns
Missing values ​​are inevitable, this problem arises from many different factors, each of which affects the offset in its own way. You need to learn how best to deal with missing values. Check out this guide on how to handle null values .
Syntax errors usually occur when a dataset contains information that has been entered manually, such as through a form. This may lead us to the erroneous conclusion that the categorical function has many more levels than it actually does, because "Hot", "hOt", "hot / n" are considered unique levels. Check out this resource on handling dirty text data.
Finally, duplicate columns are unnecessary, and duplicate rows can distort the presentation, so they should be dealt with early.
Standardization or normalization
Depending on the dataset you are working with and the machine learning method you choose to use, it may be helpful to standardize or normalize the data so that different scales of different variables do not negatively impact the performance of your model.
In general, it was not so much the “remember everything” attitude that helped me as understanding how much the training helped me. I failed many interviews before I realized that all of the above are not esoteric concepts that only a select few can master. These are the tools Data Scientists use to build cool models and get important insights from data.
On this topic:
- Interview “ I was afraid of routine tasks, but in Data Science everything is different ”;
- Practical online course " Profession " ‌Data‌ ‌Scientist‌ .