Elegant tricks in a notebook on a personal computer (laptop) are good and interesting. But as soon as it comes to executing code in a productive loop, a lot of restrictions immediately appear in the form:
- the amount of available iron;
- performance requirements;
- stability;
- compliance with IS requirements;
- … (Add spices to taste).
Today in Russia there is such a phase that for data science tasks the python language is positioned as a "silver bullet". It seems that such a thesis was put forward by those who sell courses on DS in python. And then the flywheel went. In general, this is quite normal - almost all processes in the physical world are oscillatory.
But, nevertheless, in this hype they are a little under-talked about. There are a number of annoying moments in python, even in basic DS tasks, which greatly complicate its use in a productive circuit.
Problem 1
The name of this problem is BlockManager
. This is one of the pillars of architecture pandas
. Outwardly manifested in the fact that:
- memory consumes "as if not into itself";
- the execution time of the code depends on the previous states of the interpreter and the sequence of operations and can vary by several orders of magnitude.
, . .
, , :
- 'The one pandas internal I teach all my new colleagues: the BlockManager';
-
BlockManager
pandas
Wes McKinney 'What is BlockManager and why does it exist?'; - Wes McKinney 'Apache Arrow and the "10 Things I Hate About pandas"'.
2
pandas
+ sql
/spark
( — ) data.table
+ Clickhouse
( data.frame
). Database-like ops benchmark. , .
3
Story-telling . Literate Programming. . python
, , Rmarkdown
.
It is clear that our trends are formed by courses and requirements for vacancies on hh.ru. But if we talk about solving practical problems in an enterprise, then using the R
+ bundle Clickhouse
turns out to be much more profitable. You can also add to this clip golang
, also a great tool.
Fin, get your napalm out.
Previous publication - "R, Monte Carlo and Enterprise Problems, Part 2" .