14,000x Acceleration or Computer Science Victory

As a scientific software developer, I do a lot of programming. And most people in other scientific fields tend to think that programming is "just" throwing code and running it. I have good working relationships with many colleagues, including those from other countries ... Physics, climatology, biology, etc. But when it comes to software development, you get the distinct impression that they think: β€œHey, what could be difficult ?! We just write down a few instructions on what the computer should do, press the "Run" button and we're done - we get the answer! "



The problem is, it's incredibly easy to write instructions that don't mean what you think. For example, a program may be completely defiant of computer interpretation. Besides,there is literally no way to tell if a program will terminate at all without executing it. And there are many, many ways to slow down a program's execution a lot. I mean ... really slow down. Slow it down so that it will take your whole life or more. This most often happens with programs that are written by people without computer education, that is, scientists from other fields. My job is to fix such programs.



People don't understand that computer science teaches you the theory of computation, the complexity of algorithms, computability (that is, can we really calculate something? Too often we take it for granted that we can!) code that will execute in a minimal amount of time or with minimal use of resources.



Let me show you an example of a huge optimization in one simple script written by a colleague of mine.



We scale a lot in climatology. We take temperature and precipitation readings from a large-scale global climate model and compare them to a fine-scale local grid. Let's say the global grid is 50x25 and the local grid is 1000x500. For each grid cell in the local grid, we want to know which grid cell in the global grid it corresponds to.



A simple way to represent the problem is to minimize the distance between L [n] and G [n]. It turns out such a search:



    L[i]:
      G[j]:
        L[i]  G[j]
       L[i] * G
    


Seems simple enough. However, if you look closely, we do a lot of extra work. Look at the algorithm in terms of the size of the input.



    L[i]:                           #  L 
      G[j]:                        #  L x G 
        L[i]  G[j]                 #  L x G 
       d[i*j]              #  G  L  (L x G)
   ,       #  G  L  (L x G)


The code looks something like this:



obs.lon <- ncvar_get(nc.obs, 'lon')
obs.lat <- ncvar_get(nc.obs, 'lat')
n.lon <- length(obs.lon)
n.lat <- length(obs.lat)

obs.lats <- matrix(obs.lat, nrow=n.lon, ncol=n.lat, byrow=TRUE)
obs.lons <- matrix(obs.lon, nrow=n.lon, ncol=n.lat)
obs.time <- netcdf.calendar(nc.obs)

gcm.lon <- ncvar_get(nc.gcm, 'lon')-360
gcm.lat <- ncvar_get(nc.gcm, 'lat')
gcm.lats <- matrix(gcm.lat, ncol=length(gcm.lat), nrow=length(gcm.lon),
                   byrow=TRUE)
gcm.lons <- matrix(gcm.lon, ncol=length(gcm.lat), nrow=length(gcm.lon))
gcm.lons.lats <- cbind(c(gcm.lons), c(gcm.lats))

# Figure out which GCM grid boxes are associated with each fine-scale grid point
# Confine search to 10 deg. x 10 deg. neighbourhood

dxy <- 10
mdist <- function(x, y)
    apply(abs(sweep(data.matrix(y), 2, data.matrix(x), '-')), 1, sum)
nn <- list()
for (i in seq_along(obs.lons)) {
    if((i %% 500)==0) cat(i, '')
    gcm.lims <- ((gcm.lons.lats[,1] >= (obs.lons[i]-dxy)) &
                 (gcm.lons.lats[,1] <= (obs.lons[i]+dxy))) &
                ((gcm.lons.lats[,2] >= (obs.lats[i]-dxy)) &
                 (gcm.lons.lats[,2] <= (obs.lats[i]+dxy)))
    gcm.lims <- which(gcm.lims)
    nn.min <- which.min(mdist(c(obs.lons[i], obs.lats[i]),
                        gcm.lons.lats[gcm.lims,]))
    nn[[i]] <- gcm.lims[nn.min]
}
nn <- unlist(nn)


It looks like a simple algorithm. It is β€œeasy” to calculate the distances and then find the minimum. But in such an implementation, as the number of local cells grows, our computational cost increases by its product with the number of cells in the global grid. The Canadian ANUSPLIN data contains 1068 Γ— 510 cells (544 680 total). Let's assume our GCM contains 50x25 cells (1250 in total). Thus, the cost of the inner cycle in "some computational units" is:



(c0β‹…LΓ—G)+(c1β‹…LΓ—G)+(c2β‹…LΓ—G)



where members cAre constants corresponding to the cost of calculating the distance between two points, finding the minimum point, and finding the array index. In fact, we don't (much) care about constant members because they don't depend on the size of the input. So you can just add them up and estimate the computation cost;



(cβ‹…LΓ—G)



So for this set of inputs, our cost is 544,680Γ—1,250=680,850,000...



680 million.



It seems a lot ... although, who knows? Computers are fast, right? If we run a naive implementation, in reality it will run a little faster than 1668 seconds, which is a little less than half an hour.



> source('BCCA/naive.implementation.R')
500 1000 1500 2000 2500 3000 ... 543000 543500 544000 544500 [1] "Elapsed Time"
    user   system  elapsed 
1668.868    8.926 1681.728 


But should the program run for 30 minutes? That's the problem. We are comparing two grids, each of which has a ton of structures that we have not used in any way. For example, latitudes and longitudes in both grids are in sorted order. Therefore, to find a number, you do not need to go through the entire array. You can use the halving algorithm - look at the point in the middle and decide which half of the array we want. Then searching the entire space will take the logarithm base two of the entire search space.



Another important structure that we did not use is the fact that latitudes go in dimensionx, and longitudes - in the dimension y... Thus, instead of performing the operationxΓ—y since we can do it x+ytime. This is a huge optimization.



What does it look like in pseudocode?



  local[x]:
    bisect_search(local[x], Global[x])

  local[y]:
    bisect_search(local[y], Global[y])

 2d      


In code:



## Perform a binary search on the *sorted* vector v
## Return the array index of the element closest to x
find.nearest <- function(x, v) {
    if (length(v) == 1) {
        return(1)
    }
    if (length(v) == 2) {
        return(which.min(abs(v - x)))
    }
    mid <- ceiling(length(v) / 2)
    if (x == v[mid]) {
        return(mid)
    } else if (x < v[mid]) {
        return(find.nearest(x, v[1:mid]))
    }
    else {
        return((mid - 1) + find.nearest(x, v[mid:length(v)]))
    }
}

regrid.one.dim <- function(coarse.points, fine.points) {
    return(sapply(fine.points, find.nearest, coarse.points))
}

## Take a fine scale (e.g. ANUSPLINE) grid of latitudes and longitudes
## and find the indicies that correspond to a coarse scale (e.g. a GCM) grid
## Since the search is essentially a minimizing distance in 2 dimensions
## We can actually search independently in each dimensions separately (which
## is a huge optimization, making the run time x + y instead of x * y) and
## then reconstruct the indices to create a full grid
regrid.coarse.to.fine <- function(coarse.lats, coarse.lons, fine.lats, fine.lons) {
    xi <- regrid.one.dim(gcm.lon, obs.lon)
    yi <- regrid.one.dim(gcm.lat, obs.lat)
    ## Two dimensional grid of indices
    xi <- matrix(xi, ncol=length(fine.lats), nrow=length(fine.lons), byrow=F)
    yi <- matrix(yi, ncol=length(fine.lats), nrow=length(fine.lons), byrow=T)
    return(list(xi=xi, yi=yi))
}


The cost of each bisection search is log of the input size. This time our input size is split into X and Y space, so we'll useGx,Gy,Lx and Ly for Global, Local, X and Y.



cost=LxΓ—log2Gx+LyΓ—log2Gy+LxΓ—Ly



The cost comes out at 553,076. Well, 553k sounds a lot better than 680m. How will this affect the runtime?



> ptm <- proc.time(); rv <- regrid.coarse.to.fine(gcm.lat, gcm.lon, obs.lat, obs.lon); print('Elapsed Time'); print(proc.time() - ptm)[1] "Elapsed Time"
   user  system elapsed 
  0.117   0.000   0.117 
> str(rv)
List of 2
 $ xi: num [1:1068, 1:510] 15 15 15 15 15 15 15 15 15 15 ...
 $ yi: num [1:1068, 1:510] 13 13 13 13 13 13 13 13 13 13 ...
> 


0.117 seconds. What used to take almost half an hour now takes just over one tenth of a second.



> 1668.868 / .117
[1] 14263.83


Soooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo, but even I am surprised and impressed by how much the speedup is. Approximately 14 thousand times .



Previously, the script was so slow that it saved the result to disk for manual review by scientists before continuing. Now everything is calculated in the blink of an eye. We carry out such calculations hundreds of times, so that in the end we save days and even weeks of computing time. And we get the opportunity to better interact with the system, so that the scientists' working hours are more profitable ... they do not sit idle, waiting for the end of the calculations. Everything is ready at once.



I must emphasize that these epic performance improvements do not require the purchase of any large computer systems, parallelization or increased complexity ... in fact, the faster algorithm code is even simpler and more versatile! This complete victory is achieved simply by thoughtful reading of the code and having some knowledge of computational complexity.



All Articles