Recipes, scripts and genomics: Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

Monday, April 13, 2015

Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges (or IRanges if you don't need to compare intervals with an associated chromosome and strand). foverlaps is from the data.table package and is inspired by findOvelaps.

In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.

In the foverlaps function Y has to be indexed using the setkey function (we don't have to do it on X). The key is intended to speed-up finding overlaps.

Which one is faster?

To check this we used the benchmark function from the rbenchmark package. It's a simple wrapper of the system.time function.

The code below plots the execution time of both functions for increasing numbers of rows of data set X.

Interestingly, foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.

We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.

Information about my R session:

> sessionInfo()

R version 3.1.3 (2015-03-09)

Platform: x86_64-apple-darwin13.4.0 (64-bit)

Running under: OS X 10.9.4 (Mavericks)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     

other attached packages:

[1] data.table_1.9.4     rbenchmark_1.0.0     GenomicRanges_1.18.4

[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        S4Vectors_0.4.0     

[7] BiocGenerics_0.12.1 

loaded via a namespace (and not attached):

[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 

[6] tools_3.1.3    XVector_0.6.0

11 comments:

AnonymousApril 13, 2015 at 1:02 PM
Hi Katarzyna, nice post! Thanks for the comparison. I've a couple of corrections and suggestions:

1. I'd suggest re-running with the first two lines changed to:

start <- seq(1L, 20e8L, 1000L)
end <- start + 3000L

so that intervals are integers. Your ranges are currently numeric, and allocation/joins on numeric ranges would be slower, of course, compared to integers. `foverlaps()` is designed with all numeric types, for e.g., double, int64, Date, POSIXct etc.. in mind, and therefore intervals are not automatically coerced.

2. You missed a "nomatch = 0L" argument to 'foverlaps', since 'foverlaps' by default also returns no matches, whereas 'findOverlaps' doesn't.

3. It'd be nice to benchmark on a more realistic dataset (under more realistic sizes as well). If you're interested, here's one of the same data that I've tested on: https://dl.dropboxusercontent.com/u/3851628/reads_50_sample.zip

On just "Chr1", I get 0.8 vs 3.9 seconds and 3.9 vs 17 seconds on entire dataset (foverlaps vs findOverlaps).

Note that on just one chromosome, the comparison is really foverlaps vs IRanges (more or less).

4. It'd also be nice to compare when `query` isn't sorted. Testing on the sample data linked above by randomly reordering query resulted in 10s vs 50s (foverlaps vs findOverlaps).

There are several improvements I'd like to get done in `foverlaps()`. I'll write a gist (or a vignette) and update with a link here, when I manage.

Arun,
Co-developer, data.table
@arun_sriniv

ReplyDelete
Replies
UnknownApril 14, 2015 at 9:40 AM
Hi Arun,
Thanks for your reply! I've changed code according to your suggestions - changed intervals from numeric to integer and added "nomatch = 0L" argument. Thanks for the link. So why is such difference between our results? I've checked that in your dataset there is more overlapped intervals than in this generated by me. Is it because of width of intervals?
ReplyDelete
Replies
UnknownApril 21, 2015 at 8:32 AM
This comment has been removed by the author.
ReplyDelete
Replies
Humaun KabirJune 29, 2015 at 10:37 PM
Lots of Good information in your post, I favorited your blog post so I can visit again in the future, Thanks.
genomic
ReplyDelete
Replies

Add comment

Pages

Monday, April 13, 2015

Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

11 comments: