Data Scientist's Notes: Small Tools Are Big



Most often, in the work of a datascientist, I have to transfer data from one view to another, aggregate, lead to the same granularity and clean the data, load, unload, analyze, format and send the results (which, in general, are also data in some form). There is always something wrong with the data and they need to be quickly driven back and forth - most of all, classic Unix utilities and small, but proud tools help me with this: we’ll talk about them today.



And today there will be a selection with examples and situations in which I have to use them. Everything described here and below is a real subjective experience and of course it is different for everyone, but perhaps it will be useful to someone.



Tools - learn the tools - everything written is subjective and based solely on personal experience: it helped me, maybe it will help you too.



Before you start reading, we remind you that now we are playing our game for kulkhackers , in which you can have time to win money ...


Zsh and oh-my-zsh - after all these years in Bash!



As I remember now, I was 17 years old and installed Linux. Terminal and Bash. And somehow, the bash was always a part of the process and personified for me the actual work in the terminal. 12 years after my PhD graduation, I ended up in a company where there was an introductory document and for the first time I came across a poppy and I decided to follow it.



And lo and behold! Convenient folder navigation, human autocompletion, git indicator, themes, plugins, support for a virtual environment for python, etc. - now I'm sitting in the terminal and I'm not overjoyed!





We put zsh, as you usually put everything and go to oh-my-zsh (in fact, this is a folk assembly of recipes that work out of the box and added support for plugins, themes, etc.). You can take it here . And you can also put a theme (well, for example). Here is a good demo of the possibilities. Taken from this article here.



Pipelines





One of the finest terminal designs is the pipeline. To put it simply, it allows you to wire the outputs of one command to the inputs of another, a simple application example that is literally taken from a task I was doing two days ago.



It was necessary to simulate a problem in one language to solve combinatorial problems, everything was started from the terminal and displayed in an absolutely unreadable form from the text by placing a simple icon | - connected inputs outputs and made formatting support:



| python.py format.py



A more interesting and everyday task is to evaluate some parameter or characteristic based on the uploaded data, as a rule, this is a series of quick checks that the required values ​​somewhere on the server with the data behave well - for example, we want to understand what we have with the parser and see how many unique groups are collected in all json files - this parameter should naturally grow adequately over time:



cat data/*groups* | jq .group | uniq | wc -l


We'll talk more about each of them, but the general idea is already clear:



  • cat - (short for concatenate) prints the contents of files with the word "group" in the name from the data / folder
  • jq - rips out the "group" field from json
  • uniq - leaves only unique groups
  • wc - with the -l switch counts the lines, i.e. the number of groups 


And now we'll take a closer look at wc.



WC is a small but proud utility for counting lines, words, and more.



wc - can quickly read words, lines, letters, bytes and the maximum line length, all using simple keys:



  • —bytes
  • —chars
  • —words
  • —lines
  • —max-line-length
     

It seems that this is trivial, but it turns out to be incredibly often necessary and convenient.



Everyday use, let's quickly estimate how much data we have collected (here is one line, one record):





More details here .



Ack / grep



Thousands of manuals and articles have been written about them, but I can't help but mention - they rip out the text with regulars and their own query language according to the pattern. In general, it seems to me that ack is more friendly and easier to use out of the box, so it will be here:



Example: quickly find the occurrence of the word (key “-w”) ga2m (model type), case-insensitively (key -i) in python source files :





JQ - parse json on command line



Documentation .



JQ is downright grep / ack for json (albeit with a touch of sed and awk - about the latter later) - in fact, a simple parser for json and json line on the command line, but sometimes it is extremely convenient - somehow I had to parse the wikidata archive in bz2 format, it weighs about 100GB and about 0.5TB uncompressed.



It was necessary to rip out a correspondence between several fields from it, which turned out to be done very simply on a machine with practically no load on the CPU and memory, here is the very command that I used: 



bzcat data/latest-all.json.bz2  | jq —stream 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")' | python3 scripts/post_process.py "output.csv"


It was essentially the entire pipeline that created the required mapping, as we see everything worked in stream mode: 



  • bzcat read part of the archive and gave jq
  • jq with the -stream key immediately produced the result and passed it to the postprocessor (just like with the very first example) in python
  • internally a post processor is a simple state machine


In total, a complex pipeline working in a stream mode on big data (0.5TB), without significant resources and made of a simple pipeline and a couple of tools. Would definitely recommend checking out at your leisure.



fzf - fuzzy finder



The most convenient thing (especially inside a wim): quickly searches through files, which is convenient for a large project - especially when you have several of them. As a rule, you need to quickly search for files by a certain word in a large project: I was immersed in a new project, which includes several large repositories and as an introductory task I needed to add one simple model to the assortment available in the system and I needed to quickly find your files using the ga2m keyword and work by analogy with other "code blocks" - quickly edit one or the other - here fzf comes to the rescue very well:





Link to the repository .



AWK



The name comes from the first letters of the creators of Aho, Weinberger and Kernighan: in fact, a scripting language for processing text-table data - it applies transformation patterns to each line of the file.As a



rule, it is ideal for quick one-time transformations, for example, we had a dataset assembled by hand in the form of tsv , and the processor received jsonl as input and was expecting an additional "theme" field that was not in the source file (needed for some things that were not critical for the current calculations) - in total, a simple one-line was written:



cat groups.tsv | awk '{ printf "{\"group\": \"%s\", \"theme\": \"manual\" }\n", $1  }' > group.jsonl




In fact, he took a file and wrapped each line in json with the necessary fields.



Link from tutorial .



wget - versatile command line downloader



Regularly scripts and pipelines have to pull up and download something from somewhere - and wget does not fail: it can download, log in, proxies, cookies, and besides http (s), it can do ftp.



Swiss knife in download.



HSTR - search command history, with a human face



ommand history: hstr



Regularly I have to search for something in the command history:



  • "I already had to do it"
  • "What keys does X start with?"
  • "But this piece can be reused"


Therefore, it is quite critical for me to have a good and convenient search in command history, while hstr does its job completely:







Useful but not included



In the final, I would mention useful - but pulling on a separate article of the topic - it is useful to look: 



  • Ssh + tmux + vim (neo, plugins, etc)
  • Basic knowledge of command line git + git hooks
  • Data pipeline construction make/just
  • Python virtual environments
  • Docker





All Articles