By using data.table I could get locations of exons for all human Ensembl transcripts in 5-6 seconds, while using apply() and strsplit() on a regular data.frame takes about 30 seconds ( see the EDIT below, this is probably because I used apply in an inefficient way, it may not be needed at all ). The function returns a data.table object which then can be transformed into IRanges or GenomicRanges objects easily. Doing the string operations and iteration in C or C++ may decrease this time even further.
EDIT: I'm not sure why this is particularly faster than a data.frame, I guess I used apply in an idiotic way on the BED data.frame. I applied strsplit row-by-row on 11th column of each row, rather than using strsplit directly on 11th column of the data.frame. Anyway, it was a good exercise but a bit useless :)
bed12.to.exons<-function(bed.file,skip=0) { require(data.table) colClasses=c("factor","integer","integer","factor","integer","factor","integer","integer","integer","integer","factor","factor") ref =read.table(bed.file,header=F,skip=skip,colClasses = colClasses) # give colClasses for fast read-in ref.dt=data.table(ref) # apply strsplit on columns 11 and 12 to extract exon start positions and exon sizes b.start.size=ref.dt[, cbind(as.integer(unlist(strsplit(as.character(V12),"," ))),as.integer(unlist(strsplit(as.character(V11),"," ))))] rep.ref.dt=ref.dt[rep(1:nrow(ref.dt),ref[,10]),] # replicate rows occurs as many as its exon number rep.ref.dt$V3=rep.ref.dt$V2+b.start.size[,1]+b.start.size[,2] # calculate exon start and ends rep.ref.dt$V2=rep.ref.dt$V2+b.start.size[,1]+1 return(rep.ref.dt[,1:6,with=F]) } res=bed12.to.exons(bed.file) system.time(bed12.to.exons(bed.file)) # a line from a bed file, just to show how it looks like >read.table(bed.file,header=F)[1,] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 1 chr1 223340009 223340009 ENST00000328556 0 + 223340009 223351568 0 2 105,126, V12 1 0,11433,
James Long´s segue package has a function called emrlapply that may help speed up apply() functionality across large numbers of rows. See http://code.google.com/p/segue/ and http://groups.google.com/group/segue-r/
ReplyDelete-M.