Comments on Recipes, scripts and genomics: Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

dplyr is a HUGE improvement over plyr sqldf - 0.4...

2014-02-26T13:58:43.211-08:00

dplyr is a HUGE improvement over plyr

sqldf - 0.42
doby - 4.06
aggregate - 13.30
plyr - 134.14
dataT - 0.27
dplyr - 0.24

https://github.com/hadley/dplyr

I bet adding the keys to the data.table should spe...

2014-02-19T16:52:31.630-08:00

I bet adding the keys to the data.table should speed it up.

I want to run a sql query in a loop e.g sqldf(&qu...

2011-06-28T04:30:49.721-07:00

I want to run a sql query in a loop
e.g
sqldf("select cust_id from emp")

inplace of cust_id I want to take another variable from my dataset. plz anyone tell me how to do that

thanks

What about tapply d.tapply<-function(d){ a<...

2011-03-24T10:57:31.095-07:00

What about tapply

d.tapply<-function(d){
a<-tapply(d$x,interaction(d$grp1,d$grp2), mean)
b<-tapply(d$y,interaction(d$grp1,d$grp2), mean)
index<-(!is.na(a))
a<-a[index]
b<-b[index]
return(cbind(a,b))
}

system.time(d.tapply(d))

Some good points there! I encourage anyone who is ...

2011-03-21T11:36:50.470-07:00

Some good points there! I encourage anyone who is interested to do a better planned analysis. Unfortunately, I don't have any more time to dedicate to this.

(1) The calculation of DT from d was left out of t...

2011-03-21T11:09:52.299-07:00

(1) The calculation of DT from d was left out of the timing. The timing for data table is therefore larger than shown. (2) The statement that using sum is more unbiased seems strange. Both summation and averaging are common operations that represent real usage cases. This is data dredging where you use different variations of the benchmark until you find one where you like the result. (3) The fact that minor changes in the benchmark can cause the order to change shows that the conclusions of one particular benchmark cannot be relied upon. (4) Other criteria that might be used but were not mentioned would be ease of specification and whether the method can handle intermediate results larger than memory.

Thank you for your suggestions Larry and ts. Indee...

2011-03-18T13:02:08.709-07:00

Thank you for your suggestions Larry and ts. Indeed when I use .Internal() data.table is the fastest!!

John, I don't have a lot of experience with plyr. Right now, I'm using sqldf. I learned about data.table yesterday. Seems like I will be using that too.

In R, mean has high overhead. You can replace mean...

2011-03-18T08:43:19.110-07:00

In R, mean has high overhead. You can replace mean(x) with .Internal(mean(x)) to see significant improvement in many of the approaches. The speedup in data.table is most noticeable. My timings are:

> rsqldf
user system elapsed
1.24 0.75 2.17

> rdataT # original
user system elapsed
6.42 0.02 6.50

> rdataT # using .Internal(mean(x))
user system elapsed
0.15 0.00 0.18

See also here:

http://stackoverflow.com/questions/4322219/

Not a huge time saver but in the aggregate functio...

2011-03-18T06:13:41.255-07:00

Not a huge time saver but in the aggregate function the FUN argument can be just "mean" instead of "function(x) mean(x)". It seemed to save about a second on my system.

Thanks, I followed your suggestion and added data....

2011-03-17T18:20:13.358-07:00

Thanks, I followed your suggestion and added data.table as well.

Perhaps you could give data.table a try.

2011-03-17T15:04:29.573-07:00

Perhaps you could give data.table a try.