Recipes, scripts and genomics: Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

Thursday, March 17, 2011

Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

Which one of the sqldf, plyr, doBy and aggregate functions/packages would be faster for applying functions on groups of rows? I was wondering about this earlier in this post. It seems sqldf would be the fastest according to a post in manipulatr mail list.

Well, here is the ranking from the fastest to the slowest:

(check the link for the post above for details on comparison)

The downside of sqldf is that you will be able to use only sql functions on groups, but in the rest of the other options you can apply R functions on groups.

EDIT: I repeated the experiment on my own computer, this time I included data.table package as well. The new experiment shows that sqldf is still the fastest but data.table ranks second now. But I have been warned that the mean() function has large overhead and in fact if I use .Internal(mean(x)) instead of mean(x) for the group operations, data.table is the fastest!!! Scroll down to see another comparison of the grouping functions, this time using sum() function which doesn't seem to have a large overhead.

The code is below. I mostly copy-pasted from the post at manipulatr, except the data.table part and the plot.

EDIT: More unbiased way to measure this is to use sum() function on groups. When I do that, data.table comes first!

11 comments:

AnonymousMarch 17, 2011 at 3:04 PM
Perhaps you could give data.table a try.
ReplyDelete
Replies
altunaMarch 17, 2011 at 6:20 PM
Thanks, I followed your suggestion and added data.table as well.
ReplyDelete
Replies
LarryMarch 18, 2011 at 6:13 AM
Not a huge time saver but in the aggregate function the FUN argument can be just "mean" instead of "function(x) mean(x)". It seemed to save about a second on my system.
ReplyDelete
Replies
tshortMarch 18, 2011 at 8:43 AM
In R, mean has high overhead. You can replace mean(x) with .Internal(mean(x)) to see significant improvement in many of the approaches. The speedup in data.table is most noticeable. My timings are:

> rsqldf
user system elapsed
1.24 0.75 2.17

> rdataT # original
user system elapsed
6.42 0.02 6.50

> rdataT # using .Internal(mean(x))
user system elapsed
0.15 0.00 0.18

See also here:

http://stackoverflow.com/questions/4322219/
ReplyDelete
Replies
altunaMarch 18, 2011 at 1:02 PM
Thank you for your suggestions Larry and ts. Indeed when I use .Internal() data.table is the fastest!!

John, I don't have a lot of experience with plyr. Right now, I'm using sqldf. I learned about data.table yesterday. Seems like I will be using that too.
ReplyDelete
Replies
GaborMarch 21, 2011 at 11:09 AM
(1) The calculation of DT from d was left out of the timing. The timing for data table is therefore larger than shown. (2) The statement that using sum is more unbiased seems strange. Both summation and averaging are common operations that represent real usage cases. This is data dredging where you use different variations of the benchmark until you find one where you like the result. (3) The fact that minor changes in the benchmark can cause the order to change shows that the conclusions of one particular benchmark cannot be relied upon. (4) Other criteria that might be used but were not mentioned would be ease of specification and whether the method can handle intermediate results larger than memory.
ReplyDelete
Replies
altunaMarch 21, 2011 at 11:36 AM
Some good points there! I encourage anyone who is interested to do a better planned analysis. Unfortunately, I don't have any more time to dedicate to this.
ReplyDelete
Replies
UnknownMarch 24, 2011 at 10:57 AM
What about tapply

d.tapply<-function(d){
a<-tapply(d$x,interaction(d$grp1,d$grp2), mean)
b<-tapply(d$y,interaction(d$grp1,d$grp2), mean)
index<-(!is.na(a))
a<-a[index]
b<-b[index]
return(cbind(a,b))
}

system.time(d.tapply(d))
ReplyDelete
Replies
sanjeebJune 28, 2011 at 4:30 AM
I want to run a sql query in a loop
e.g
sqldf("select cust_id from emp")

inplace of cust_id I want to take another variable from my dataset. plz anyone tell me how to do that

thanks
ReplyDelete
Replies
Vladislav PetyukFebruary 19, 2014 at 4:52 PM
I bet adding the keys to the data.table should speed it up.
ReplyDelete
Replies
KenFebruary 26, 2014 at 1:58 PM
dplyr is a HUGE improvement over plyr

sqldf - 0.42
doby - 4.06
aggregate - 13.30
plyr - 134.14
dataT - 0.27
dplyr - 0.24

https://github.com/hadley/dplyr
ReplyDelete
Replies

Add comment

Pages

Thursday, March 17, 2011

Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

11 comments: