Very gentle resource for speeding up R code

Nathan Uyttendaele has written a great beginner’s guide to speeding up your R code. Abstract:

Most calculations performed by the average R user are unremarkable in the sense that nowadays, any computer can crush the related code in a matter of seconds. But more and more often, heavy calculations are also performed using R, something especially true in some fields such as statistics. The user then faces total execution times of his codes that are hard to work with: hours, days, even weeks. In this paper, how to reduce the total execution time of various codes will be shown and typical bottlenecks will be discussed. As a last resort, how to run your code on a cluster of computers (most workplaces have one) in order to make use of a larger processing power than the one available on an average computer will also be discussed through two examples.

Unlike many similar guides I’ve seen, this really is aimed at a computing novice. You don’t need to be a master of the command line or a Linux expert (Windows and Mac are addressed too). You are walked through installation of helpful non-R software. There’s even a nice summary of how hardware (hard drives vs RAM vs CPU) all interact to affect your code’s speed. The whole thing is 60 pages, but it’s a quick read, and even just skimming it will probably benefit you.

Favorite parts:

  • “The strategy of opening R several times and of breaking down the calculations across these different R instances in order to use more than one core at the same time will also be explored (this strategy is very effective!)” I’d never realized this is possible. He gives some nice advice on how to do it with a small number of R instances (sort of “by hand,” but semi-automated).
  • I knew about rm(myLargeObject), but not about needing to run gc() afterwards.
  • I haven’t used Rprof before, but now I will.
  • There’s helpful advice on how to get started combining C code with R under Windows—including what to install and how to set up the computer.
  • The doSMP package sounds great — too bad it’s been removed 🙁 but I should practice using the parallel and snow packages.
  • P.63 has a helpful list of questions to ask when you’re ready to learn using your local cluster.

One thing Uyttendaele could have mentioned, but didn’t, is the use of databases and SQL. These can be used to store really big datasets and pass small pieces of them into R efficiently, instead of loading the whole dataset into RAM at once. Anthony Damico recommends the column-store database system MonetDB and has a nice introduction to using MonetDB with survey data in R.

10 responses to “Very gentle resource for speeding up R code

  1. It’s almost never necessary to run gc() – R automatically runs it when needed.

  2. Just skimmed it a bit more thoroughly. The basic message is ok, but a lot of the details are wrong:

    * Looking at the memory that the OS allocates to R is not a good indicator of what’s going on.

    * Don’t use Rprof, use a friendlier wrapper around it (e.g. lineprof)

    * Don’t use the C api, use Rcpp

    * Don’t mess around with path environment variables, use devtools or RStudio which set it for you automatically.

    * Don’t open multiple R instances, use mclapply()

    • Thanks for pointing those out! To be fair, some of those aren’t “wrong” so much as “there’s an even friendlier tool for it in the Hadleyverse” 🙂

      As for Rcpp, I know it’s much nicer than the C API when you’re starting from scratch. But if you already have substantial legacy code in C that you want to connect to R, I assume you need the C API after all. Or can Rcpp handle C code too?

    • Be nice Hadley, the details you picked from my paper are not wrong, it’s just that, as civilstat pointed out (thanks!), some of those aren’t “wrong” so much as “there’s an even friendlier tool for it in the Hadleyverse”.

      * Looking at the memory that the OS allocates to R is not a good indicator of what’s going on.
      >> I’m sure there is better way of doing this. Yet monitoring the memory usage of an R instance using Windows Task Manager (Windows) or Activity Monitor (Mac) yields good enough results on my side and is easy to explain to my students.

      * Don’t use Rprof, use a friendlier wrapper around it (e.g. lineprof)
      >> It does not mean that if I don’t use this last wrapper “I’m wrong”.

      * Don’t use the C api, use Rcpp
      >> I actually did not know about Rcpp until recently ; I received many mails about this since I’ve made my paper available online. I promise to edit my paper accordingly in the near future.

      * Don’t mess around with path environment variables, use devtools or RStudio which set it for you automatically.
      >> Again, it does not mean I’m wrong if I do it like shown in my paper.

      * Don’t open multiple R instances, use mclapply()
      >> I’ll add mclapply in the paper in the near future. However, I don’t understand what’s wrong in using multiple R instances.

  3. C++ is a superset of C, so it’s fine to use Rcpp with existing C code. I’ve just written bindings to a whole bunch of existing C libraries (libxml, libpq, libmysql, sqlite, …) and can testify that it works 🙂

  4. * Don’t use Rcpp when vectorizing part of the code is trivial

    ksmooth2 <- function(data, xpts, h) {
    dens <- double(length(xpts))
    n <- length(data)
    for(i in 1:length(xpts)) {
    d <- xpts[i] - data
    ksum <- dnorm(sum(d)/h)
    dens[i] system.time({
    + fx_estimated1 system.time({
    + fx_estimated1 <- ksmooth2(data, xpts, h)
    +
    + })[[3]]
    [1] 0.08

    • Of course you are right, but the point of the example you picked from the paper was merely to show how to change some R code into C code. I’ll add a note in the near future, so that people know that vectorizing part of the code is very effective in this case, too.

  5. looks like that was garbled. the gist is that if you vectorize the inner loop (which is something the author has just suggested) you get speeds comparable to the C code and future you will be happier.

    On my system ksmooth1 took 15.78 seconds and ksmooth2 tool 0.08 second.

    #using equals over arrow to avoid html issues
    #this is the rewrite of the innerloop
    d = xpts[i] – data
    ksum = dnorm(sum(d)/h)
    dens[i] = ksum / (n * h)