Performance Computing with R Tips and Tricks...

Tips and Tricks forPerformance Computing with R

Bryan W [email protected]

http://goo.gl/gcPezs

First tip:

Read Patrick Burns' The R Infernohttp://www.burns-stat.com/documents/books/the-r-inferno

— If you are using R and you think you’re in hell,this is a map for you.

Numeric librariesNumeric librariesNumeric libraries

The very heart of R

R BLAS/LAPACK PERFORMANCE

Mac OS X is pretty good

GNU/Linux is OK to poor but easiest to improve and most flexible

Windows is generally poor but easy to improve

http://cran.r-project.org/bin/windows/base/rw-FAQ.html#Can-I-use-a-fast-BLAS_003f




Commercial BLAS/LAPACK libraries exhibit the best all-around performance.

Intel MKL -- Superb performance, but not free and can be tricky to use.

AMD ACML -- Freely available on Linux and Windows, pretty easy to use.

We're going to install free commercial BLAS/LAPACK libraries on Windows and Linux.

Brace yourself.

It's not hard, but not pretty either.

Installing ACML for R on Windows

1. Download and install ACML (PGI Windows version) http://developer.amd.com/amd-license-agreement/?f=acml4.4.0-win64.exe

2. Install vcredist_x64.exe http://www.microsoft.com/en-us/download/details.aspx?id=30679 www.microsoft.com/download/en/details.aspx?14632

3. copy e:\AMD\acml4.4.0\win64\lib\* "c:\Program Files\R\R-3.0.1\bin\x64\"

copy "c:\Program Files\R\R-3.0.1\bin\x64\Rblas.dll" "c:\Program Files\R\R-3.0.1\bin\x64\Rblas.save"

copy "c:\Program Files\R\R-3.0.1\bin\x64\libacml_dll.dll" "c:\Program Files\R\R-3.0.1\bin\x64\Rblas.dll"

4. Set OMP_NUM_THREADS environment variable.

http://developer.amd.com/amd-license-agreement/?f=acml5.3.1-win64.exe



http://www.microsoft.com/en-us/download/details.aspx?id=30679

http://www.microsoft.com/en-us/download/details.aspx?id=30679

http://www.microsoft.com/download/en/details.aspx?14632



Why the old library version in the previous slide?

I couldn't get newer versions to work using multiple threads on Windows (although they work fine in single-threaded mode).

If you have a lot of cores, use the old library version shown. Otherwise, you can use the latest library version in single-threaded mode and still typically achieve large performance gains.

Installing ACML for R on Linux

Download and install ACML (GFortran Linux version) http://developer.amd.com/amd-license-agreement/?f=acml-5-3-1-gfortran-64bit.tgz

cp /opt/acml5.3.1/gfortran64_mp/lib/* \ /usr/local/R/lib/

cd /usr/local/R/lib

cp libRblas.so libRblas.so.backup

cp libacml_mp.so libRblas.so

Set OMP_NUM_THREADS environment variable.

http://developer.amd.com/amd-license-agreement/?f=acml-5-3-1-gfortran-64bit.tgz




set.seed(1)A = matrix(rnorm(2000^2),2000)S = svd(A)

Caveat Emptor

There is an issue with the propagation of NA and NaN values in the AMD library version for Linux shown in the previous slides.

A discussion of the issue can be found here:

http://devgurus.amd.com/thread/153983



Vectorization

Example: Find overlapping time intervals within each unique ID grouping in a data frame

ID start end 1 2001-11-26 2002-06-03 1 2002-08-30 2002-10-15 1 2002-10-07 2003-01-27 1 2003-08-27 2003-11-18 1 2004-02-11 2004-06-23 1 2004-07-23 2005-02-10 2 2003-02-24 2003-02-28 2 2003-07-11 2003-09-09 2 2004-06-26 2004-10-16 3 2002-09-15 2002-12-18

Easy enough to spot that row 2 overlaps row 3 by eye. But what about doing this automatically and efficiently on vast data?

f = function(X){ overlap = c() for(j in 1:(nrow(X)-1)) { if (X[j,"ID"] == X[j+1, "ID"] && X[j,"end"] > X[j+1, "start"]) { overlap = c(overlap, j) } } overlap}

This will work, but slowly

There are at least two performance-related problems with this approach, do you see them?

Let's try again...maybe compile it?

Unfortunately, this doesn't help much (at least on my cheap laptop).

library("compiler")cf = cmpfun(f)

The vectorized way...

v = function (X){ block = c(diff(X$ID) == 0,TRUE) up = c(X$start, Inf)[2:(nrow(X)+1)]

which ((up < X$end) & block)}

ID start end block 1 2001-11-26 2002-06-03 TRUE 1 2002-08-30 2002-10-15 TRUE 1 2002-10-07 2003-01-27 TRUE 1 2003-08-27 2003-11-18 TRUE 1 2004-02-11 2004-06-23 TRUE 1 2004-07-23 2005-02-10 FALSE 2 2003-02-24 2003-02-28 TRUE 2 2003-07-11 2003-09-09 TRUE 2 2004-06-26 2004-10-16 FALSE 3 2002-09-15 2002-12-18 TRUE

Gist 1. Start by adding the 'block' variable:

ID start up end block 1 2001-11-26 2002-08-30 2002-06-03 TRUE 1 2002-08-30 2002-10-07 2002-10-15 TRUE 1 2002-10-07 2003-08-27 2003-01-27 TRUE 1 2003-08-27 2004-02-11 2003-11-18 TRUE 1 2004-02-11 2004-07-23 2004-06-23 TRUE 1 2004-07-23 2003-02-24 2005-02-10 FALSE 2 2003-02-24 2003-07-11 2003-02-28 TRUE 2 2003-07-11 2004-06-26 2003-09-09 TRUE 2 2004-06-26 2002-09-15 2004-10-16 FALSE 3 2002-09-15 <NA> 2002-12-18 TRUE

2. Shift the start column up

Gist

ID start up end block 1 2001-11-26 2002-08-30 2002-06-03 TRUE 1 2002-08-30 2002-10-07 2002-10-15 TRUE 1 2002-10-07 2003-08-27 2003-01-27 TRUE 1 2003-08-27 2004-02-11 2003-11-18 TRUE 1 2004-02-11 2004-07-23 2004-06-23 TRUE 1 2004-07-23 2003-02-24 2005-02-10 FALSE 2 2003-02-24 2003-07-11 2003-02-28 TRUE 2 2003-07-11 2004-06-26 2003-09-09 TRUE 2 2004-06-26 2002-09-15 2004-10-16 FALSE 3 2002-09-15 <NA> 2002-12-18 TRUE

3. Compare shifted start with end column and block condition

Gist

Vectorization performance

Tested on my slow laptop with 10,000 rows of data like the example:

For loop 6.8 seconds

Compiled for loop 6.4 seconds

Vectorized 0.009 seconds (!!)

Parallel computing with foreach

—Don't mourn slow-runningprograms...Organize them!

Foreach defines an abstract interface to parallel computing.

Computations are performed by 'back ends' that register with foreach.

The same code works sequentially or in parallel.

foreach (iterator, ...) %dopar% {R expression}

library("foreach")

foreach(j=1:4) %dopar% {j}

[[1]][1] 1

[[2]][1] 2

[[3]][1] 3

[[4]][1] 4

foreach(j=1:4, .combine=c) %dopar% {j}

[1] 1 2 3 4

foreach(j=1:4, .combine=`+`) %dopar% {j}

[1] 10

Mapped expression

Reduction function

Foreach respects lexical scope—it just works...

z = 2f = function(x) { sqrt(x + z) }

foreach(j=1:4, .combine=c) %dopar% { f(j) }

[1] 1.732051 2.000000 2.236068 2.449490

Foreach figures out that the mapped expression needs to know definition of f the value of z.

Foreach automatically handles exporting this value to to wherever the work is being computed.

Nesting parallel loops

Use %:% to nest foreach loops. The example creates one set of 15 tasks instead of 3 tasks, which might be better load-balanced across available resources:

foreach(x=0:2) %:% foreach(y=1:5,.combine=c) %dopar% { x+y }

[[1]][1] 1 2 3 4 5

[[2]][1] 2 3 4 5 6

[[3]][1] 3 4 5 6 7

List comprehension-like syntax

Use `when` to add a filter predicate

foreach(x=0:2) %:% foreach(y=1:5,.combine=c) %:% when(x<y) %dopar% {x+y}

[[1]][1] 1 2 3 4 5

[[2]][1] 3 4 5 6

[[3]][1] 5 6 7

Some available parallel backends...

doMPIdoSNOWdoMC (Unix-like OS only)doNWSdoSMP (Windows only, maybe unmaintained now)doDeathstar (Zero MQ based--nifty!)doRedis (Elastic, fault-tolerant, cross-platform)

See http://goo.gl/G9VAA for a different presentation about elastic computing in R with doRedis.

http://goo.gl/G9VAA

There exist many other superb parallel computing techniques for R.

http://cran.r-project.org/web/views/HighPerformanceComputing.html



Know your algorithms

Say you want a few principal components of a matrix...

set.seed(55)A = matrix(rnorm(1000^2),1000)P = princomp(A)

princomp works, but computes way more than we want!

Say you want a few principal components of a matrix...

set.seed(55)A = matrix(rnorm(1000^2),1000)P = princomp(A)

library("irlba")C = scale(A, center=TRUE, scale=FALSE)P = irlba(C, nu=2, nv=2)

princomp works, but computes way more than we want!

irlba efficiently computes just what we want.

Example performance, 1000x1000 matrix, two principal components (computed on my cheap laptop)

And the performance gain grows with bigger problems...

Higher performance GLMs(well, glm.fit really)

GLM boils down to a nonlinear optimization problem.

It's usually solved with iteratively re-weighted least squares, an iteration something like:

Where X is a model matrix, W a weight matrix, beta are the model coefs, and Yhat is a quantity derived from the response vector.

IRWLS can be split into a map/reduce-like parallel problem.

Let:

Then

These are independent partial products that can be computed in parallel.Here, Z is a value that depends on Yhat in the previous slide...

The code ends up looking something like...

#Initilize partitions IX of model matrix X on the cluster somehow...

combiner = function(x,y){ list(XTWX=x$XTWX + y$XTWX, XTWz=x$XTWz + y$XTWz)}...

PX = foreach(icount(np), .combine=combiner, .inorder=FALSE) %dopar% { list(XTWX=crossprod(X, (W[IX] * X)), XTWz=t(crossprod(W[IX]*z[IX],X))) }

beta = solve(PX$XTWX, PX$XTWz, tol=2*.Machine$double.eps)...

Quick example

Logistic regression

10 million observations x 200 variables

bigglm works, but takes quite a while

I just ran a quick test (not really well optimized) on a 4-computer, 32 CPU core Linux cluster. It took about 5 minutes.

Notes...

The speedglm package almost got this right, but didn't think about parallelism.

There are some numerical stability issues to think about with this approach.

Look for code examples on http://illposed.net soon...

Performance gems to know...

Rcpp almost turns C++ into R(!!), making C++ much nicer to use, and making it really easy to mix R and C++ ideas and code.

Bigmemory lets R work with matrices larger than available RAM. Flexmem employs a trick to let any R object exceed available RAM size.

SciDB lets R easily work in parallel on distributed arrays. SciDB can handle really big data problems (tens of terabytes or more).

Programming bid data in R (pbdR) defines Scalapack-based distributed dense arrays and parallel methods for R. It runs on giant supercomputers at ORNL.

rmr is the most elegant R/Hadoop integration I know of.

A few algorithm gems...

Jordan's Bag of Little Bootstraps may be one of the more important algorithms for big data mining in a while. It shows that many problems can be computed in an embarrassingly parallel way (that is, partitioned into fully independent sub-problems whose answers can be combined to form a final result).

GLMNet is an extension of the elastic net variable selection/regularization method from the Friedman, Hastie, and Rob Tibshirani that uncovers and exploits a remarkable computational trick to gain substantial computational efficiency. It's super cool.

Benzi, Boito, Estrada and others have come up with some amazingly efficient and very elegant techniques for efficiently estimating functions of huge graphs. See, for example, http://www.mathcs.emory.edu/~benzi/Web_papers/adjacency_paper.pdf.

http://www.mathcs.emory.edu/~benzi/Web_papers/adjacency_paper.pdf



Tips and Tricks forPerformance Computing with R

Bryan W [email protected]

http://goo.gl/gcPezs

Performance Computing with R Tips and Tricks...

Documents

Transcript of Performance Computing with R Tips and Tricks...