In this article, I would like to share my experience of a scientific career in the field of Data Science, accumulated over the past year and a half.
This is my first post on Medium, so I would like to talk about myself and my previous experiences. I am a PhD student in Environmental Engineering and Computing at Harvard University, and I also work as a machine learning and blockchain consultant for the UK-based artificial intelligence consulting firm Critical Future. My research focuses on introducing machine learning and artificial intelligence into environmental science using drone-based sensory systems capable of self-propelling to map out the chemistry of the lower atmosphere, predominantly in the Amazon rainforest (for those interested in this project, I will post separate articles on this topic in the near future).
I started my PhD journey at Harvard University in the fall of 2017 with a BA and MA in mechanical engineering from Imperial College London, and completed my final year abroad at the National University of Singapore. During my undergraduate studies, I was not very familiar with Data Science and statistics in general, but at the same time I knew a lot about programming in Matlab, C and Visual Basic, and also had a strong mathematical background.
Before I started at Harvard, I had never programmed in Python, or even heard of R. I had never done parallel computing, never created clusters, and machine learning and artificial intelligence were things I usually only heard about. from dystopian novels and films.
Attending a Harvard Computer Science and Machine Learning program with such a humble background was like climbing a cliff (grueling and wobbly). However, this is Harvard, so you can hardly expect anything less. The Harvard PhD program requires 10 courses, of which usually 8 are Master's. They can be completed at your own pace, but you must complete them before graduation, which takes 5 years on average. Students are encouraged to complete all courses within the first two years, after which they can earn their (formally free) master's degree. At the end of the spring 2019 semester, I will meet these requirements and receive my diploma, after which I will focus exclusively on research.
In the fall of 2018, Harvard launched the first ever group of students for a Master's program in Data Science. It is a two-year program consisting of core courses in Data Science, Ethics, and Applied Mathematics, Computer Science, and electives in Statistics / Economics. Arriving a year before all of these students, I will be one of the first to meet the basic prerequisites for this program, giving me a unique experience in terms of the effectiveness of my Data Science degree.
Over the past 18 months, I have taken a number of courses. One of the first was CS205: Parallel Computing, where I first learned to program on Linux and created compute clusters capable of linear acceleration of matrix computations, and this course culminated in a final project that included parallel computing in Python with Dask on a Kubernetes cluster.
I also took AM207: Advanced Scientific Computing, which is offered by the Harvard Extension School (which means anyone can take this course). This course focused on Bayesian statistics and its implementation in machine learning, and included countless hours of Monte Carlo Markov Chain (MCMC) simulations, working with Bayesian Theorem, and even watching a short video about Superman that made time turn. reversal (to demonstrate the concept of time reversibility in machine learning)
Also one of the core courses is AC209a, which focuses on the fundamentals of Machine Learning and Data Science. I would say that this course includes what most people think of when someone says the words "Data Science" or "Machine Learning." It's about learning how to do exploratory data analysis and run regressors and classifiers using sklearn. Most of the lessons focus on understanding these techniques and how best to optimize them for a given dataset (it takes a little more than just using model.fit (X_train, y_train) ...). Another course is AC209b: Additional Data Science Sections, which is an extension of the first class. Basically, this is a Data Science course on steroids,in which the first few lectures start with generalized additive models and creating nice splines to describe datasets. However, things quickly escalate into running 2500 models in parallel using Dask on a Kubernetes cluster in an attempt to perform hyperparametric optimization on a 100-layer artificial neural network. At the same time, in fact, it was not even the most difficult thing that we did - all this happened only in the third week of lectures, if we talk about the course as a whole.it was not even the most difficult thing that we did - it all happened only in the third week of lectures, if we talk about the course as a whole.it was not even the most difficult thing that we did - it all happened only in the third week of lectures, if we talk about the course as a whole.
I've also taken other courses, including CS181: Machine Learning, which covers the mathematical foundations of regression, classification, reinforcement learning and other areas using both frequency-based and Bayesian methods; AM205: Scientific Methods for Solving Differential Equations, and AM225: Advanced Methods for Solving Partial Differential Equations. There are many other courses I could also take during the remaining time at Harvard to deepen my knowledge, such as CS207: Systems Engineering for Computational Science, AM231: Decision Theory, or AM221: Advanced Optimization. I should also clarify that each of these courses had a final project that I was able to add to my portfolio.
Now on to the topic of the article - after all this time that I spent learning how to be a good Data Scientist, was it worth it? Or could I have done it all myself? More specifically, is it worth it for someone looking to pursue this as a career invest 1-2 years and more than $ 100,000 in a Data Science degree?
I don't think that everything I learned in these 18 months of Data Science courses I could learn by reading books, watching online videos, and studying the documentation of various software packages. However, I have no doubt that earning a degree in Data Science can accelerate someone's career, as well as provide valuable experience with real-world projects that could be discussed during interviews and used in a portfolio. Personally, it would take me years to figure out how to optimize a 100-layer neural network running on a parallel cluster in Google Cloud if I were just sitting at home and watching a video on Youtube - I couldn't even imagine how to do it.
Curiosity about Data Science is great and I would like more people to be interested in this topic. Since the information explosion, it seems that in the next decade data will become the new world religion, and therefore it is inevitable that the world will need many more specialists in Data Science. However, curiosity can take you very far, and having a piece of paper that shows you have spent time, invested in skills and good habits, and become a truly accomplished data scientist will set you apart from the rest. Data Science doesn't just exist as a Kaggle competition, as some seem to think.
My advice for those looking to do Data Science is to get a good foundation in statistics and mathematics, I also advise you to gain some programming experience in languages such as Python and R, as well as master Linux development. Most of the computer science students I've seen seem to struggle with computer science-related aspects like working with Docker containers and creating and managing distributed clusters running on some cloud infrastructure. There are many complex skills to master to become an experienced Data Scientist, and I certainly cannot call myself an expert. However, with some experience, I feel confident enoughthat I can continue to develop my own skills in Data Science and Machine Learning and apply them to projects and research related to industry, without fear of doing "bad science".
If you want to know what a Data Science course is, I recommend taking a look at the online courses offered by universities, which often earn you the credits you need to complete your degree. There is now a student at Harvard who completed 3 courses in Computer Science at the Extension School and now has a degree in Computing and Engineering and is one of the teaching assistants in the Advanced Data Science course. Everything is possible!
Online courses in Data Science with a state diploma from MISIS
NUST MISIS and SkillFactory (an online school on Data Science) have signed an agreement to create a joint online master's program “Data Science”, which will include internships in real projects, chat rooms with mentors, and an individual training plan. Classes will be taught by NUST MISIS professors and practitioners from Mail.ru Group, Yandex, Tinkoff and VTB banks, Lamoda, BIOCAD, AlfaStrakhovanie and others.
This is the first case in Russia of a partnership between a private educational company and a state university based on the OPM model (Online Program Management). The industrial partner of the program will be Mail.ru Group. The program is also supported by NVidia, Rostelecom and NTI University "20.35".
Bachelor's degree graduates of any direction will be able to enroll in the master's program based on the results of the online exam.You can apply right now and until August 10.
Useful materials
- Don't become a Data Scientist
- 450 free courses from the Ivy League
- Free Data Science Courses from Harvard University
- 109 Free Data Science Courses
- 65 Free Machine Learning Courses From Top World Universities
- Sorry, but online courses won't make you Data Scientist
- How to Learn to Be a Data Scientist: Most In-Demand Technical Skills
- Philosophy of Teaching Data Science and Deep Learning by fast.ai
- How I (PhD in Neuroscience) Became a Data Scientist in 6 Months
- Most successful and most controversial Data Science project: Cambridge Analytica
- Python.org recommends: Programming for Non-programmers