Parallelizing code in R in a couple of minutes

If stereotypes are to be believed, the R language is something highly specialized for statistics and machine learning. The second stereotype is that pure R code is not very fast: firstly, because it is interpreted, and secondly, because it is executed sequentially. Of course, stereotypes have some connection with reality, otherwise they would not exist, but that is why they are stereotypes, which give an extremely simplified picture of the world, in which a lot of details are lost. In particular, today I want to share a surprisingly simple way to add parallelism to R and dramatically speed up the execution of existing code without having to make any major changes to it. All this is done in just a couple of minutes.



Let's say we have a matrix or data table containing a number of rows and columns, and we want to perform some kind of the same calculation for each of the rows. For example, calculate the sum of the squares of its values. It is logical to move the calculations into a function and call it for each of the lines.



Initial data:



a <- matrix(rnorm(500000, mean=0, sd=2), 100000, 50)


Function:



sum.of.squares <- function(n) {
  n_sq <- sapply(n, function(x) x^2)
  sum(n_sq)
}


You can simply loop over the lines and apply this function to each of the lines, but this is not the most recommended way for R. Calculations for each of the lines will be performed sequentially, all calculations will be performed on the same core. This kind of code is really not very efficient. Just in case, let's write down this option and measure the execution time:



b <- vector()
for(i in 1:dim(a)[1]) {
  b[i] <- sum.of.squares(a[i,])
}


We measure the execution time:



b <- vector()
start_time <- Sys.time()
for(i in 1:dim(a)[1]) {
  b[i] <- sum.of.squares(a[i,])
}
timediff <- difftime(Sys.time(), start_time)
cat(" : ", timediff, units(timediff))


We get:



 :  4.474074 secs


We will use this time as some starting point for comparison with other methods.



. R apply(). , : 1, 2. , . – sapply(), . – . , apply() :



b <- apply(a, 1, function(x) sum.of.squares(x))


, . , , :



start_time <- Sys.time()
b <- apply(a, 1, function(x) sum.of.squares(x))
timediff <- difftime(Sys.time(),start_time)
cat(" : ", timediff, units(timediff))


:



 : 4.484046 secs


, . : , .



, , R , . : apply(), , . , , . , apply(). apply() by(), eapply(), lapply(), Map(), .mapply(), mapply(), replicate(), sapply(), tapply(), vapply(). , future_apply:



install.packages("future.apply") 


– . , :



library("future.apply")
plan(multiprocess)


. , . future::plan(). , , apply "future_". :



b <- future_apply(a, 1, function(x) sum.of.squares(x))


:



start_time <- Sys.time()
b <- future_apply(a, 1, function(x) sum.of.squares(x))
timediff <- difftime(Sys.time(),start_time)
cat(" : ", timediff, units(timediff))


:



 :  1.283569 secs


Intel Core i7-8750H 12 . 12-, .



. , , , , , . , , future_sapply, . . – . , , , (a <- data.frame(a)), , 8 . .



Well, that's all. The method is quite simple. For me, when I found out about him, it was just a godsend. Is it true that current R doesn't support parallel computing? Depends on the point of view on this issue, on the severity of its statement. But in a sense, we can assume that it does support.




All Articles