👐🏼 🕳️ 🎄 Nuances of operating R solutions in an enterprise environment 💬 🎸 👐

R-based solutions, both classic reporting and operational analytics, have proven themselves very well in an enterprise environment. Undoubtedly, RStudio and its passionate team play a significant role in this . In commercial RStudio products, you don't have to think about infrastructural issues, but simply exchange a little money for a ready-made solution "out of the box" and rely on their developers and support. In open-source editions, and most installations in Russian companies are just like that, you have to think about infrastructure issues on your own.

R solutions well close the niche of "average data", when there is "a little more" data than fits into excel or an unconfigured relational system and complex algorithms and processing are needed, but when it is still too early to deploy the "launch complex" of big data, our tasks are still within the Earth's orbit. We are talking about tens to hundreds of terabytes in full, which can easily fit into the backend at Clikhouse . An important point: everything is in an internal circuit, in the overwhelming majority of cases, TOTALLY cut off from the Internet.

Continuing a series of previous publications , refines the Building Blocks for a Robust Enterprise R Application .

Problematic

For a productive solution, you need to ensure reproducible results and calculations. The problem of reproducibility is divided into several different directions. Large blocks can be distinguished:

infrastructural reproducibility. Many questions are closed with a combination of docker + renv + git technologies.
software reproducibility. Many questions are closed by the technology of packages and autotests.
statistical "similarity" of the results. This is where the specificity of each individual task arises. Some points are suggested below, allowing to provide it.

What is the difficulty?

Algorithms "rolled out into production"

can be multiphase with a cumulative calculation time of several hours;
«» ( , excel , ..);
, ;
;
.
, ;
, . .

( ) , . , . , prod (), dev .

, . , , . , . . , (availability) X% $Y. .

. , .
, .
. .
.
data.frame

, «» , .

, :

futile.logger
lgr
logger

warning

message

, . , .

, .Rds

(1-1000 Ram) . 3 :

fst
qs
arrow

-- . , , .. .

checkmate -- + ;
skimr -- ;
validate -- ;
testthat / tinytest -- ;
dplyr / data.table -- .

, . Win-Vector.

pipe (%>%

). . - ( «» « », ), . , , .

.

tidylog. , tidylog

tidyverse

, dpylr::mutate

.
lumberjack.

, :

«Debugging with RStudio by Jonathan McPherson»
«Advanced R», . «Debugging»

(shiny )?

browser()

. IDE. . -- . .
debug()

/undebug()

/debugonce()

. , .., .
traceback()

. .
options(datatable.verbose = TRUE)

. data.table

( , , ).
utils::getFromNamespace

. .
waldo diffobj. .
pryr::object_size()

. «» .
reprex

. .
gginnards

. ggplot.

browser()

, data.table

.

library(data.table)
library(magrittr)

dt <- as.data.table(mtcars) %>%
  .[, {m <- head(.SD, 2); print(ls()); browser(); m}, by = gear]
#>  [1] "-.POSIXt"  "am"        "carb"      "Cfastmean" "cyl"       "disp"     
#>  [7] "drat"      "gear"      "hp"        "m"         "mpg"       "print"    
#> [13] "qsec"      "strptime"  "vs"        "wt"       
#> Called from: `[.data.table`(., , {
#>     m <- head(.SD, 2)
#>     print(ls())
#>     browser()
#>     m
#> }, by = gear)

. ( ) , .

bench
microbenchmark
system.time({…})
profvis
proffer

.

? . , , .
make

. drake/targets
In practical tasks there can be all sorts of surprises, the magic of bigdata does not always help, we read the ironic detective "Using AWK and R to parse 25tb"

Previous publication - "How to tame process mining with R in an enterprise?" ...

Nuances of operating R solutions in an enterprise environment

Problematic

More articles: