Sunday, March 20, 2011

Fast(ish) extraction of exon locations from a BED12 file using data.table

Here is a fast R function to extract exon locations from a BED12 file. Note that fast is a relative term, the function below is fast enough for me, may not be fast enough for others :) Anyway, a BED12 file typically has locations of genomic features (those features are usually genes or transcripts if the format is BED12 ). 11th and 12th columns of the BED12 file contains information about exon sizes and exon starts, see here for more info on BED format. To be able to use that information one has to employ string functions since exon sizes and starts are concatenated to each other with commas. A straightforward way to do is to use strplit() and apply() to extract numbers located in strings, but it takes too much time if you have a large number of rows.

By using data.table I could get locations of exons for all human Ensembl transcripts in 5-6 seconds, while using apply() and strsplit() on a regular data.frame takes about 30 seconds ( see the EDIT below, this is probably because I used apply in an inefficient way, it may not be needed at all ). The function returns a data.table object which then can be transformed into IRanges or GenomicRanges objects easily. Doing the string operations and iteration in C or C++ may decrease this time even further.

EDIT: I'm not sure why this is particularly faster than a data.frame, I guess I used apply in an idiotic way on the BED data.frame. I applied strsplit row-by-row on 11th column of each row, rather than using strsplit directly on 11th column of the data.frame. Anyway, it was a good exercise but a bit useless :)

bed12.to.exons<-function(bed.file,skip=0)
{
  require(data.table)
  colClasses=c("factor","integer","integer","factor","integer","factor","integer","integer","integer","integer","factor","factor") 
  ref   =read.table(bed.file,header=F,skip=skip,colClasses = colClasses) # give colClasses for fast read-in
  ref.dt=data.table(ref)
 
  # apply strsplit on columns 11 and 12 to extract exon start positions and exon sizes
  b.start.size=ref.dt[, cbind(as.integer(unlist(strsplit(as.character(V12),"," ))),as.integer(unlist(strsplit(as.character(V11),"," ))))]
  rep.ref.dt=ref.dt[rep(1:nrow(ref.dt),ref[,10]),] # replicate rows occurs as many as its exon number
 
  rep.ref.dt$V3=rep.ref.dt$V2+b.start.size[,1]+b.start.size[,2] # calculate exon start and ends
  rep.ref.dt$V2=rep.ref.dt$V2+b.start.size[,1]+1
 
  return(rep.ref.dt[,1:6,with=F])
} 
 
res=bed12.to.exons(bed.file) 
system.time(bed12.to.exons(bed.file))
 
 
# a line from a bed file, just to show how it looks like
>read.table(bed.file,header=F)[1,]
      V1        V2        V3              V4 V5 V6        V7        V8 V9 V10      V11
1 chr1 223340009 223340009 ENST00000328556  0  + 223340009 223351568  0   2 105,126,
         V12
1 0,11433,
Created by Pretty R at inside-R.org

1 comment:

  1. James Long´s segue package has a function called emrlapply that may help speed up apply() functionality across large numbers of rows. See http://code.google.com/p/segue/ and http://groups.google.com/group/segue-r/

    -M.

    ReplyDelete