Wednesday, August 10, 2011

Multiple cores in R, revisited

The bigmemory package in combination with doMC provides at least a partial solution for sharing a large data set across multiple cores in R. With this solution you can work on the same matrix using several threads. It is also a very scalable solution. I've used this on files of several Gb. The limitation is that all the values in the matrix needs to be of the same type (typically integer).

The following code reads in a bed-like file with numerical values for chromosome (1:24) and strand (1,0), processes the file by parallelizing on chromosomes, and returns the values as a list. Note the use of the descriptor to identify the shared object. Any change on the shared object will immediately be visible for all processes.

library(bigmemory)
library(doMC)
registerDoMC(cores=24)


bigtab <- read.big.matrix(filename, sep="\t" col.names=c('chr','start','end','strand'),
type='integer', shared=FALSE)

descriptor <- describe(bigtab)

result <- foreach(chr = seq(1,24)) %dopar% {
tab <- attach.big.matrix(descriptor)
tab.chr <- tab[tab[,'chr'] == chr,]
# Do some stuff with these values
# and return result
}


Cool, huh?

1 comment: