R code Optimization

Katia Bulekova

2022-03-29

We will be using two packages:

library(microbenchmark)
library(profvis)

Timing of expression evaluation

Suppose we would like to measure a time it takes to run a particular line or lines within a script. We can use system.time() function:

system.time( mad( runif(1000 ) ))
#>    user  system elapsed 
#>   0.001   0.000   0.001

The problem with this approach is that some functions run too fast. So we need to repeat them many times to see some non-zero values in the output:

system.time( for(i in 1:100)  mad( runif(1000 ) ))
#>    user  system elapsed 
#>   0.029   0.005   0.034

The user time is the CPU time charged for the execution of user instructions of the calling process. It includes the time needed for computation.

The system time is the CPU time charged for execution by the system on behalf of the calling process. It includes the time needed for memory allocation, accessing hardware (network), reading or writing operations, etc. These operations are usually very time consuming and need to be minimized and used with care.

The elapsed time is wall clock time - time from start to finish of the call.

If we are interested in measuring the elapsed time (and possibly comparing it with another implementation of the same task), a better approach is to use the microbenchmark() function from a package with the same name.
Note: Be careful: by default, the microbenchmark function runs the expression a hundred times.

microbenchmark( mad( runif(1000 ) ), times=100 )
#> Unit: microseconds
#>              expr     min       lq     mean   median     uq     max neval
#>  mad(runif(1000)) 150.111 162.4395 167.2235 165.6895 169.53 260.035   100

R Loops: Are they really bad?

You may read or hear that R loops are very slow. Indeed, the following loop seems to take forever to perform a very simple operation:

system.time ({
  x <- c()
  for (i in  1:1e5) x <- c( x, i )
})
#>    user  system elapsed 
#>  12.080   0.160  12.241

However, the real reason, why it takes so long to compute x here, is the dynamic expanding of x. Let’s modify our code a little and print x after each iteration:

x <- c()
for (i in  1: 7) {
  x <- c( x, i )
  
  print( paste("Iteration", i, " x: ", paste(x, collapse=", ")))
}
#> [1] "Iteration 1  x:  1"
#> [1] "Iteration 2  x:  1, 2"
#> [1] "Iteration 3  x:  1, 2, 3"
#> [1] "Iteration 4  x:  1, 2, 3, 4"
#> [1] "Iteration 5  x:  1, 2, 3, 4, 5"
#> [1] "Iteration 6  x:  1, 2, 3, 4, 5, 6"
#> [1] "Iteration 7  x:  1, 2, 3, 4, 5, 6, 7"

During each iteration, R has to allocate a new larger chunk of memory, copy the previous, value of x, and then de-allocate memory. Operations with memory (especially memory allocation) are very time-consuming and should be minimized.

Let’s allocate the memory first and then fill out the array within the loop:

x <- numeric(7)
for (i in  1: 7) {
  x[i] <- i
  
  print( paste("Iteration", i, " x: ", paste(x, collapse=", ")))
}
#> [1] "Iteration 1  x:  1, 0, 0, 0, 0, 0, 0"
#> [1] "Iteration 2  x:  1, 2, 0, 0, 0, 0, 0"
#> [1] "Iteration 3  x:  1, 2, 3, 0, 0, 0, 0"
#> [1] "Iteration 4  x:  1, 2, 3, 4, 0, 0, 0"
#> [1] "Iteration 5  x:  1, 2, 3, 4, 5, 0, 0"
#> [1] "Iteration 6  x:  1, 2, 3, 4, 5, 6, 0"
#> [1] "Iteration 7  x:  1, 2, 3, 4, 5, 6, 7"

Measuring the time to fill out the 1e5-long x vector:

system.time({
  x <- numeric(1e5)
  for (i in  1: 1e5) {
    x[i] <- i
  }
})
#>    user  system elapsed 
#>   0.007   0.000   0.007

The same situation occurs when we use rbind() within a loop to add additional rows to a matrix or a data frame:

#slow!
matr <-NULL
system.time( for(i in seq(1e4)) matr <-rbind(matr, 1:10) )
#>    user  system elapsed 
#>   0.745   0.036   0.781
#better
matr <- matrix(NA, nrow=1e4, ncol=10)
system.time( for(i in seq(1e4) ) matr[i,] <- 1:10)
#>    user  system elapsed 
#>   0.006   0.000   0.005

As we can see, the main problem was in the code, not in the loop! Saying that, the loop has some overhead and whenever possible we should vectorize our code, but it can be safely used when appropriate.

Vectorization

Below is a simple case of taking an advantage of R vectorization. We compare two approaches of computing a logarithm of each element of a vector and then adding all elements together:

x <- rnorm( 1e4, 100,2 )

microbenchmark({
  # Slow method:
  total <- 0
  for(i in seq_along(x)) total <- total + log(x[i])
},

{
  # Faster:
  log_sum <- sum(log(x))
})
#> Unit: microseconds
#>                                                                       expr
#>  {     total <- 0     for (i in seq_along(x)) total <- total + log(x[i]) }
#>                                             {     log_sum <- sum(log(x)) }
#>       min        lq      mean    median        uq      max neval cld
#>  2871.843 2928.1920 3171.2189 2971.4760 3333.7445 8537.560   100   b
#>   314.734  319.7805  335.6021  323.7565  357.1405  366.385   100  a

Let’s list the operations R needs to perform using the loop approach:

  1. It has to call log() 1e4 times and pass input parameters.
  2. It needs to allocate a memory 1e4 times to store the resulting value.
  3. It needs to execute the addition and then perform the assignment operation also 1e4 times.

In contrast, the vectorized approach:

  1. Calls log() function only once.
  2. Allocates memory to store the result only once (though for a much larger object).
  3. Calls sum() function once and performs the final assignment operation only once.

Similarly, we should avoid using if() statement within a loop to compute a vector using another vector. The ifelse() functions performs the same task much more efficiently:

x <- rnorm(1e5,0,1)
vec <- numeric(1e5)

microbenchmark(
  # Slow method: using if() within a loop 
  "if:" = ( for(i in 1:length(vec)) if(x[i] < 0) vec[i] <- -1 else 1),

  # Faster method: use ifelse() function
  "ifelse:" = ( vec <-ifelse (x < 0, -1, 1 ) )
)
#> Unit: milliseconds
#>     expr      min       lq     mean   median      uq      max neval cld
#>      if: 8.998399 9.266198 9.427125 9.352911 9.45562 12.21461   100   b
#>  ifelse: 2.332400 2.457369 3.378499 2.496164 4.11170 18.29508   100  a

Tip: seq_along() is generally faster and safer to use than 1:length(x):

x<- rnorm(1e5)
microbenchmark( 1:length(x), seq_along(x) )
#> Unit: nanoseconds
#>          expr min    lq   mean median    uq  max neval cld
#>   1:length(x) 224 240.5 336.85  257.5 332.0 3705   100   b
#>  seq_along(x) 116 134.5 188.71  154.5 188.5 1856   100  a

In the cases when the input x vector has zero length (as a result of a bug or an incorrect input) , seq_along() will doo the right thing:

x <- NULL

# Using 1:length(), the loop will execute two times!
for ( i in 1:length(x) ) print(i)
#> [1] 1
#> [1] 0
# Using seq_along() the loop will execute 0 times
for ( i in seq_along(x) ) print(i)

Random Numbers and vectorization

Random number generation is a relatively slow operation. In this example we create random numbers one at a time:

slow<-function(){
  vals<- as.numeric(1e5)
  for(i in 1:1e5) vals[i] <- rnorm(1)
}

And here we generate random numbers using a single rnorm function call:

fast<-function(){
  vals<-rnorm(1e5)
}

Let’s use microbenchmark() function to compare these two approaches:

microbenchmark(slow(), fast(), times=10 )
#> Unit: milliseconds
#>    expr        min         lq       mean     median         uq       max neval
#>  slow() 204.492434 205.735406 219.102406 207.380764 210.769643 321.73002    10
#>  fast()   6.222917   6.232032   6.387098   6.238346   6.253368   7.72048    10
#>  cld
#>    b
#>   a

Similar approach should be used when working with other functions, including sample().

Note: There is an R package - dqrng - that might be considered for random number generation:


microbenchmark(slow(), fast(), "dqrng"=dqrng::dqrnorm(1e5), times=10)
#> Unit: microseconds
#>    expr        min         lq       mean      median         uq        max
#>  slow() 205924.259 207621.974 208790.460 208738.3230 210418.033 211231.077
#>  fast()   6188.793   6238.763   6250.755   6246.2280   6263.155   6315.276
#>   dqrng    804.661    815.949   9780.719    845.5415    855.673  90288.405
#>  neval cld
#>     10   b
#>     10  a 
#>     10  a

Large matrix multiplication

When performing matrix multiplication, consider using brackets to minimize the number of computations, that computer needs to perform. For example, we know that the order of matrix and vector multiplication can be changed without affecting the result:

# matrix A * matrix B * vector v
A <- matrix(rnorm(10000), nrow = 100)
B <- matrix(rnorm(10000), nrow = 100)
v <- rnorm(100)
microbenchmark(
  result1 <- A %*% B %*% v, 
  result2 <- A %*% (B %*% v) )
#> Unit: microseconds
#>                        expr     min      lq      mean   median       uq     max
#>    result1 <- A %*% B %*% v 258.237 260.592 264.35291 264.0890 266.8995 276.933
#>  result2 <- A %*% (B %*% v)  28.264  28.842  29.34891  29.0685  29.2440  43.812
#>  neval cld
#>    100   b
#>    100  a

# Check if results are equal
all.equal (result1, result2)
#> [1] TRUE

In the first case, we first compute matrix/matrix multiplication, and get a matrix as a result. This matrix is then multiplied by a vector. In the second case, we first multiply a matrix by a vector, which gives us a vector as a result and then we multiply a matrix by a vector again, which reduces the number of multiplications by a hundred (the number of rows in the matrices.)

When appropriate use crossprod() and tcrossprod() for computing crossproduct. They are slightly faster than t(x) %*% y (crossprod) or x %*% t(y) (tcrossprod)

Computing matrix and data.frames row means and sums

#Define a matrix and a data.frame
matr <- matrix(runif(1e4), nrow=100)

df <- as.data.frame(matr)

R implemented two optimized functions to calculate row means and sums of a matrix or a dataframe:

microbenchmark( 
  apply(matr, 1, mean),
  rowMeans(matr) )
#> Unit: microseconds
#>                  expr     min       lq      mean   median       uq     max
#>  apply(matr, 1, mean) 570.057 580.6060 589.59984 587.9365 595.4905 710.220
#>        rowMeans(matr)  24.747  25.5735  26.89985  27.2620  27.7635  36.419
#>  neval cld
#>    100   b
#>    100  a

Similarly for a dataframe the built-in rowMeans is significantly faster:

microbenchmark( 
  apply(df, 1, mean),
  rowMeans(df) )
#> Unit: microseconds
#>                expr      min       lq      mean    median       uq        max
#>  apply(df, 1, mean) 1153.350 1166.525 2580.0438 1193.2945 1258.560 134939.266
#>        rowMeans(df)  465.388  481.476  523.4131  509.6755  521.514    693.704
#>  neval cld
#>    100   a
#>    100   a

We see a similar time improvement with the rowSums()

microbenchmark( 
  apply(df, 1, sum),
  rowSums(df) )
#> Unit: microseconds
#>               expr     min       lq     mean   median       uq      max neval
#>  apply(df, 1, sum) 856.596 881.8645 977.7758 895.1445 927.1920 8385.386   100
#>        rowSums(df) 461.338 476.5985 493.4157 487.2330 513.5875  564.414   100
#>  cld
#>    b
#>   a

Computing matrix and data.frames column means and sums

When we need to work with columns, colMeans() and colSums() provide the same time improvement for matrices. However when we work with a dataframe, we should consider using the lapply() function:

matr <- matrix(runif(1e4), ncol=100)
df <- as.data.frame(matr)

microbenchmark( 
  apply(df, 2, mean),
  colMeans(df),
  lapply(df, mean),
  times=10 
)
#> Unit: microseconds
#>                expr      min       lq      mean    median       uq      max
#>  apply(df, 2, mean) 1066.836 1077.395 1101.7095 1085.0135 1109.214 1232.229
#>        colMeans(df)  458.012  466.079  486.0774  470.7885  477.485  628.296
#>    lapply(df, mean)  278.574  283.286  287.4123  285.4715  293.646  298.747
#>  neval cld
#>     10   c
#>     10  b 
#>     10 a

Data Frames vs. Matrices

When working with a dataframe, when possible, use column names to access columns or dataframe values.

microbenchmark(
  "[50, 3]" = df[50,3],
  "$V3[50]" = df$V3[50]
)
#> Unit: nanoseconds
#>     expr   min      lq     mean  median      uq   max neval cld
#>  [50, 3] 11107 11761.5 12669.46 12035.5 12529.5 52854   100   b
#>  $V3[50]   774  1066.0  1352.95  1233.0  1435.5 12708   100  a

A dataframe in R is also a list - a list of columns. Each column is stored separately in memory. The most efficient way extracting a column from a dataframe is by using the column name:

microbenchmark(matr[ ,7], df[ ,7], df$V7)
#> Unit: nanoseconds
#>       expr  min     lq    mean median     uq   max neval cld
#>  matr[, 7]  942 1296.0 1494.64 1433.5 1575.5  6616   100  a 
#>    df[, 7] 6715 7574.5 8540.65 8047.5 8533.0 43265   100   b
#>      df$V7  747  938.5 1130.54 1080.0 1230.5  5087   100  a

At the same time accessing rows in a dataframe is a very costly operation as it requires extracting an element from each column and then creating a new dataframe with a single row. Elements of a matrix are stored continuously in memory as a long vector. As a result it is very fast to access elements from a row and create an output vector.

microbenchmark(matr[7, ], df[7, ])
#> Unit: microseconds
#>       expr     min       lq      mean   median       uq     max neval cld
#>  matr[7, ]   1.253   2.1650   2.64029   2.7290   2.9725   6.076   100  a 
#>    df[7, ] 560.138 574.3515 590.34775 579.5385 595.5235 664.841   100   b

If you are working with all-numeric data, matrix is a faster R object to use than a data.frame. Creating and accessing rows in a data.frame takes significantly more time than creating a matrix or accessing its elements.

# use a data.frame to return the result of a function
d <- function() {
  data.frame(
    col1 = sample(1:10, 30, replace = TRUE)
  )
}

# use a matrix to return the result of a function
m <- function() {
  matrix(sample(1:10, 30, replace = TRUE), nrow=30)
}

# compare 2 approaches
microbenchmark(
  df = d(),
  mat = m()
)
#> Unit: microseconds
#>  expr     min      lq      mean   median       uq      max neval cld
#>    df 130.034 134.467 155.33058 135.7825 138.6015 1788.198   100   b
#>   mat   7.121   8.341  30.29864   9.7170  13.3075 1960.421   100  a

pmax() and pmin() vs. max() and min()

max and min return the maximum or minimum of all the values present in their arguments.

pmax and pmin take one or more vectors (or matrices) as arguments and return a single vector giving the parallel maxima (or minima) of the vectors. The first element of the result is the maximum (minimum) of the first elements of all the arguments, the second element of the result is the maximum (minimum) of the second elements of all the arguments and so on:

x <- c(4, 19, 23, 3, 20)
y <- c(9, 24, 21, 4, NA)
z <- pmax(x, y, 5) # vector as long as larger of x and y
     # where z[i] is max of x[i], y[i], and 5
z
#> [1]  9 24 23  5 NA

When we need to compute row maxima or minima for a matrix, depending on the shape, we should use max and min or pmax or pmin:

If a dataframe is “long” (has relatively few columns and many more rows), pmin and pmax are much more efficient:


library(matrixStats)
x <- matrix(rnorm(1e5,0,1),ncol=10 )
df <- as.data.frame(x)

microbenchmark( 
  apply(df, 1, max),
  do.call( pmax, df) , 
  rowMaxs(as.matrix(df)),   # from matrixStats package
  times = 3)
#> Unit: microseconds
#>                    expr       min        lq      mean    median        uq
#>       apply(df, 1, max) 16939.440 17061.514 17148.314 17183.587 17252.751
#>       do.call(pmax, df)   781.224   785.802   787.852   790.380   791.166
#>  rowMaxs(as.matrix(df))   964.132   972.653  1005.031   981.174  1025.481
#>        max neval cld
#>  17321.915     3   b
#>    791.952     3  a 
#>   1069.787     3  a

However, if the dataframe has a large number of columns (wide), then pmax() and pmin() may no longer be suitable for computing maxima and minima of the rows:


x <- matrix(rnorm(1e7,0,1),nrow=10 )
df <- as.data.frame(x)

microbenchmark( 
  apply(df, 1, max),
  do.call( pmax, df) , 
  times = 3)
#> Unit: seconds
#>               expr      min       lq     mean   median       uq      max neval
#>  apply(df, 1, max) 5.929419 6.316609 6.627789 6.703799 6.976974 7.250150     3
#>  do.call(pmax, df) 2.648240 3.067606 3.289333 3.486972 3.609880 3.732787     3
#>  cld
#>    b
#>   a

The for loop has a [condition] argument

Consider the following data:

df <- data.frame(age = runif(10^3, min=0, max=100),
                 body.temp = runif(10^3, min=35.5, max = 42),
                 bmi = rnorm(10^3, mean=24, sd=2)
                 )

We want to compute a vector, that for the people with a BMI greater than 27 performs some analysis, while for others, it sets the value to be -1:

#a slow method
method1 <- function(){
  
  # Allocate the memory and set all values to be equal to -1
  result <- rep( -1, nrow(df))
  
  # Loop through all rows of df and compute result
  for ( i in 1 : nrow(df) ){
    
    if ( df$bmi[i] > 27 ){
      
      if ( df$age[i] > 18 ) 
        result[i] = mean(rnorm(1000, mean=df$body.temp[i])) else median(rnorm(1000,mean=df$body.temp[i]))
      
    }
  }
}

We can first compute a logical vector that is equal to TRUE for those rows, where the BMI value is greater than 27. Then we will execute the loop only for those iterations where condition is TRUE:

#a faster method
method2 <- function(){
  
  # Allocate the memory and set all values to be equal to -1
  result <- rep( -1, nrow(df))
  condition <- df$bmi > 28
  
  # Loop through all rows of df and compute result
  for ( i in (1 : nrow(df)) [condition ]){
    
    if ( df$age[i] > 18 ) 
        result[i] <- mean(rnorm(1000, mean=df$body.temp[i])) else median(rnorm(1000,mean=df$body.temp[i]))

  }
}

Let’s compare two approaches:

microbenchmark(method1(), method2(), rep=10)
#> Unit: nanoseconds
#>       expr     min        lq       mean  median      uq      max neval cld
#>  method1() 7120937 7146574.0 7267934.94 7159550 7169243 17865592   100   c
#>  method2() 1952371 1964580.0 2068049.88 1971768 1978381 11443255   100  b 
#>        rep      17      29.5      70.11      49      85     1543   100 a

Keep it simple

Make your code clear and simple. Explore the functions arguments and documentation. Avoid extra calculations or setting unnecessary attributes.

For example, unlist() function by default returns a named vector. In many cases, we are interested in values only. Setting use.names=FALSE will result in better performance:

# Create a list
my.list <- split(sample(1000, size = 100000, rep=TRUE), rep(1:10000, times=10))

# Use unlist function to convert list to a vector
microbenchmark(
  v <- unlist(my.list), 
  v <- unlist(my.list, use.names = FALSE) )
#> Unit: microseconds
#>                                     expr       min         lq       mean
#>                     v <- unlist(my.list) 29126.201 29151.1735 29382.2819
#>  v <- unlist(my.list, use.names = FALSE)   252.199   255.9045   260.1722
#>      median        uq       max neval cld
#>  29167.0950 29190.377 41972.032   100   b
#>    258.4715   262.078   300.894   100  a

R code profiling

x <- matrix(rnorm(10000000), nrow=1000)
x[sample(1:10000000, 1000, replace=FALSE)] <- NA
slowFun = function(x) {
  # initialize res 
  res = NULL
  n = nrow(x)
  
  for (i in 1:n) {
    if (!any(is.na(x[i,]))) res = rbind(res, x[i,])
  }
  apply(res,1,mean)
}

# Start profiler
Rprof (interval = 0.01)

  # run code
   x.nomiss <- slowFun(x)
   
# end profiler
Rprof(NULL)


# view results
summaryRprof()
#> $by.self
#>                 self.time self.pct total.time total.pct
#> "rbind"             10.15    94.33      10.15     94.33
#> "slowFun"            0.25     2.32      10.76    100.00
#> "aperm.default"      0.18     1.67       0.18      1.67
#> "apply"              0.12     1.12       0.30      2.79
#> "is.na"              0.06     0.56       0.06      0.56
#> 
#> $by.total
#>                       total.time total.pct self.time self.pct
#> "slowFun"                  10.76    100.00      0.25     2.32
#> "block_exec"               10.76    100.00      0.00     0.00
#> "call_block"               10.76    100.00      0.00     0.00
#> "eng_r"                    10.76    100.00      0.00     0.00
#> "eval"                     10.76    100.00      0.00     0.00
#> "evaluate_call"            10.76    100.00      0.00     0.00
#> "evaluate::evaluate"       10.76    100.00      0.00     0.00
#> "evaluate"                 10.76    100.00      0.00     0.00
#> "handle"                   10.76    100.00      0.00     0.00
#> "in_dir"                   10.76    100.00      0.00     0.00
#> "knitr::knit"              10.76    100.00      0.00     0.00
#> "process_file"             10.76    100.00      0.00     0.00
#> "process_group.block"      10.76    100.00      0.00     0.00
#> "process_group"            10.76    100.00      0.00     0.00
#> "rmarkdown::render"        10.76    100.00      0.00     0.00
#> "timing_fn"                10.76    100.00      0.00     0.00
#> "withCallingHandlers"      10.76    100.00      0.00     0.00
#> "withVisible"              10.76    100.00      0.00     0.00
#> "rbind"                    10.15     94.33     10.15    94.33
#> "apply"                     0.30      2.79      0.12     1.12
#> "aperm.default"             0.18      1.67      0.18     1.67
#> "aperm"                     0.18      1.67      0.00     0.00
#> "is.na"                     0.06      0.56      0.06     0.56
#> 
#> $sample.interval
#> [1] 0.01
#> 
#> $sampling.time
#> [1] 10.76

Use profvis packge to find the bottlenecks in your code.

Code guidelines

A number of good suggestion to make your code clean and easy to read can be found on CRAN and Bioconductor websites: