Today there are 100,500 courses in Data Science and it has long been known that the most money in Data Science can be earned precisely by courses in Data Science (why dig when you can sell shovels?). The main disadvantage of these courses is that they have nothing to do with real work: no one will give you clean, processed data in the required format. And when you leave the courses and begin to solve the real problem, many nuances emerge.
Therefore, we are starting a series of notes "What can go wrong with Data Science", based on real events that happened to me, my comrades and colleagues. We will analyze typical tasks in Data Science using real examples: how it actually happens. Let's start today with the task of collecting data.
And the first thing that people stumble over when they start working with real data is actually collecting this most relevant data to us. The key message of this article:
We systematically underestimate the time, resources, and effort involved in collecting, cleaning, and preparing data.
And most importantly, we will discuss what to do to prevent this.
According to various estimates, cleaning, transformation, data processing, feature engineering, etc. take 80-90% of the time, and analysis 10-20%, while almost all educational material focuses exclusively on analysis.
Let's take as a typical example a simple analytical problem in three variants and see what kind of "aggravating circumstances" are.
Again, as an example, we will consider similar variations of the task of collecting data and comparing communities for:
- Two subreddits of Reddit
- Two Habr sections
- Two groups of Odnoklassniki
Conditional approach in theory
Open the site and read examples, if it's clear, put in a few hours for reading, a few hours for the code using examples and debugging. Add a few hours to collect. Throw in a few hours in reserve (multiply by two and add N hours).
Key point: The time estimate is based on assumptions and guesses about how long it will take.
It is necessary to start the time analysis by evaluating the following parameters for the conditional problem described above:
- What is the size of the data and how much it needs to be physically collected (* see below *).
- How long does it take to collect one record and how long does it take to collect the second.
- To lay down the writing of code that saves the state and starts a restart when (and not if) everything falls down.
- , API.
- , โ : , , .
- .
- , , a workaround.
Most importantly, to estimate time - you actually need to invest the time and effort in "reconnaissance in force" - only then will your planning be adequate. Therefore, no matter how you are pushed to say โhow long does it take to collect dataโ - take time for a preliminary analysis and argue how much the time will vary depending on the real parameters of the problem.
And now we will demonstrate specific examples where such parameters will change.
Key point: The assessment is based on an analysis of the key factors that influence the volume and complexity of the work.
Guess estimation is a good approach when the functional elements are small enough and there are not many factors that can significantly affect the structure of the problem. But in the case of a number of Data Science tasks, such factors become extremely numerous and such an approach becomes inadequate.
Comparison of Reddit Communities
Let's start with the simplest case (as it turns out later). In general, to be completely honest, this is an almost ideal case, let's check our difficulty checklist:
- There is a neat, straightforward and documented API.
- It is extremely simple and important that a token is automatically obtained.
- There is a python wrapper - with a bunch of examples.
- A community that analyzes and collects data on the reddit (up to youtube videos explaining how to use the python wrapper) , for example .
- The methods we need most likely exist in the API. Moreover, the code looks compact and clean, below is an example of a function that collects comments on a post.
def get_comments(submission_id):
reddit = Reddit(check_for_updates=False, user_agent=AGENT)
submission = reddit.submission(id=submission_id)
more_comments = submission.comments.replace_more()
if more_comments:
skipped_comments = sum(x.count for x in more_comments)
logger.debug('Skipped %d MoreComments (%d comments)',
len(more_comments), skipped_comments)
return submission.comments.list()
Taken from this collection of handy wrapper utilities.
Despite the fact that we have the best case here, it is still worth considering a number of important factors from real life:
- API limits - we are forced to take data in batches (sleep between requests, etc.).
- Collection time - for a complete analysis and comparison, you will have to set aside significant time just for the spider to walk through the subreddit.
- The bot must run on the server - you can't just run it on your laptop, put it in your backpack and go on business. So I ran everything on a VPS. With the habrahabr10 promo code, you can save another 10% of the cost.
- Physical inaccessibility of some data (they are visible to admins or are too difficult to collect) - this must be taken into account, not all data, in principle, can be collected in adequate time.
- Network errors: Networking is a pain.
- This is living real data - it is never clean.
Of course, it is necessary to include the specified nuances in the development. Specific hours / days depend on development experience or experience in working on similar tasks, nevertheless, we see that here the task is exclusively engineering and does not require additional gestures to solve - everything can be very well evaluated, painted and done.
Comparison of Habr sections
Let's move on to a more interesting and non-trivial case of comparing streams and / or Habr sections.
Let's check our difficulty checklist - here, in order to understand each point, you already have to poke around a little into the problem itself and experiment.
- At first you think there is an API, but there isn't. Yes, yes, Habr has an API, but only it is not available to users (or maybe it does not work at all).
- Then you just start parsing the html - "import requests", what could go wrong?
- How do you parse? The simplest and most frequently used approach is to iterate over IDs, note that it is not the most efficient and will have to handle different cases - for example, the density of real IDs among all existing ones.
Taken from this article. - , HTML โ . , : score html :
1) int(score) : , , "โ5" โ , (, ?), - .
try: score_txt = post.find(class_="score").text.replace(u"โ","-").replace(u"+","+") score = int(score_txt) if check_date(date): post_score += score
, ( check_date ).
2) โ , .
3) .
4) ** **. - In fact, error handling and what may or may not happen will have to be handled and you cannot predict for sure what will go wrong and how else the structure may be and what will fall off where - you will just have to try and take into account the errors that the parser throws.
- Then you understand that you need to parse into several threads, otherwise the parse into one will then take 30+ hours (this is purely the execution time of an already working single-threaded parser that sleeps and does not fall under any bans). In this article, this led at some point to a similar pattern:
Total difficulty checklist:
- Working with web and html parsing with iteration and search by ID.
- Documents of heterogeneous structure.
- There are many places where code can fall easily.
- It is necessary to write || the code.
- Missing documentation, code samples, and / or community.
The conditional time estimate for this task will be 3-5 times higher than for collecting data from Reddit.
Comparison of Odnoklassniki groups
Let's move on to the most technically interesting case described. For me, it was interesting precisely because at first glance, it looks quite trivial, but it does not turn out to be so - as soon as you poke it with a stick.
Let's start with our difficulty checklist and note that many of them will turn out to be much more difficult than they look at first:
- There is an API, but it almost completely lacks the necessary functions.
- You need to ask for access to certain functions by mail, that is, the issuance of access is not instantaneous.
- ( , , โ , - ) , , , , .
- , โ API, , - .
- , โ wrapper ( ).
- Selenium, .
1) ( ).
2) c Selenium ( ok.ru ).
3) . JavaScript .
4) , โฆ
5) API, wrapper , , ( ):
def get_comments(args, context, discussions): pause = 1 if args.extract_comments: all_comments = set() #makes sense to keep track of already processed discussions for discussion in tqdm(discussions): try: comments = get_comments_from_discussion_via_api(context, discussion) except odnoklassniki.api.OdnoklassnikiError as e: if "NOT_FOUND" in str(e): comments = set() else: print(e) bp() pass all_comments |= comments time.sleep(pause) return all_comments
:
OdnoklassnikiError("Error(code: 'None', description: 'HTTP error', method: 'discussions.getComments', params: โฆ)โ)
6) Selenium + API . - It is necessary to save the state and restart the system, handle many errors, including inconsistent behavior of the site - and these errors, which are quite difficult to imagine (if you are not writing parsers professionally, of course).
The conditional time estimate for this task will be 3-5 times higher than for collecting data from Habr. Despite the fact that in the case of Habr we use a head-on approach with HTML parsing, and in the case of OK we can work with the API in critical places.
conclusions
No matter how demanded you are to estimate the deadlines โon the spotโ (we have planning today!) Of the volumetric data processing pipeline module, the execution time is almost never possible to estimate even qualitatively without analyzing the task parameters.
More philosophically speaking, agile assessment strategies are good for engineering problems, but with problems that are more experimental and, in a sense, "creative" and research, that is, less predictable, difficulties arise, as in examples of similar topics that we have analyzed here.
Of course, collecting data is just a prime example - usually the task seems incredibly simple and technically uncomplicated, and it is in the details that the devil lurks most often. And it is on this task that it turns out to show the whole range of possible options for what can go wrong and how long the work can take.
If you skim the corner of your eye over the characteristics of the problem without additional experimentation, then Reddit and OK look similar: there is an API, a python wrapper, but in fact, the difference is huge. Judging by these parameters, the Habr parse looks more complicated than OK - but in practice it is quite the opposite, and this is what can be found out by conducting simple experiments on the analysis of the problem parameters.
In my experience, the most effective approach is an approximate estimate of the time that you will need for the preliminary analysis itself and simple first experiments, reading the documentation - they will allow you to give an accurate estimate for the entire work. In terms of the popular agile methodology, I ask you to create a ticket for me under the "estimate of the parameters of the problem", on the basis of which I can give an estimate of what is possible to accomplish within the "sprint" and give a more accurate estimate for each problem.
Therefore, the most effective argument seems to be one that would show the โnon-technicalโ specialist how much time and resources will vary depending on parameters that have yet to be estimated.