tag:blogger.com,1999:blog-6721211197199573684.post2316859431026100691..comments2024-02-23T00:25:42.548-08:00Comments on Recipes, scripts and genomics: Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?Unknownnoreply@blogger.comBlogger12125tag:blogger.com,1999:blog-6721211197199573684.post-9859352819034395432014-02-26T13:58:43.211-08:002014-02-26T13:58:43.211-08:00dplyr is a HUGE improvement over plyr
sqldf - 0.4...dplyr is a HUGE improvement over plyr<br /><br />sqldf - 0.42<br />doby - 4.06<br />aggregate - 13.30<br />plyr - 134.14<br />dataT - 0.27<br />dplyr - 0.24<br /><br />https://github.com/hadley/dplyrKenhttps://www.blogger.com/profile/14891472765210641599noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-5407609680217281882014-02-19T16:52:31.630-08:002014-02-19T16:52:31.630-08:00I bet adding the keys to the data.table should spe...I bet adding the keys to the data.table should speed it up.Vladislav Petyukhttps://www.blogger.com/profile/16172453469080704677noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-33819127916971941292011-06-28T04:30:49.721-07:002011-06-28T04:30:49.721-07:00I want to run a sql query in a loop
e.g
sqldf(&qu...I want to run a sql query in a loop <br />e.g<br />sqldf("select cust_id from emp")<br /><br />inplace of cust_id I want to take another variable from my dataset. plz anyone tell me how to do that<br /><br />thankssanjeebhttps://www.blogger.com/profile/01790945747794131644noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-15678024056707136142011-03-24T10:57:31.095-07:002011-03-24T10:57:31.095-07:00What about tapply
d.tapply<-function(d){
a<...What about tapply<br /><br /><br />d.tapply<-function(d){<br />a<-tapply(d$x,interaction(d$grp1,d$grp2), mean)<br />b<-tapply(d$y,interaction(d$grp1,d$grp2), mean)<br />index<-(!is.na(a))<br />a<-a[index]<br />b<-b[index]<br />return(cbind(a,b))<br />}<br /><br />system.time(d.tapply(d))Anonymoushttps://www.blogger.com/profile/11545228437925280177noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-25825760298905755792011-03-21T11:36:50.470-07:002011-03-21T11:36:50.470-07:00Some good points there! I encourage anyone who is ...Some good points there! I encourage anyone who is interested to do a better planned analysis. Unfortunately, I don't have any more time to dedicate to this.altunahttps://www.blogger.com/profile/04295081899402275119noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-5460305829895278722011-03-21T11:09:52.299-07:002011-03-21T11:09:52.299-07:00(1) The calculation of DT from d was left out of t...(1) The calculation of DT from d was left out of the timing. The timing for data table is therefore larger than shown. (2) The statement that using sum is more unbiased seems strange. Both summation and averaging are common operations that represent real usage cases. This is data dredging where you use different variations of the benchmark until you find one where you like the result. (3) The fact that minor changes in the benchmark can cause the order to change shows that the conclusions of one particular benchmark cannot be relied upon. (4) Other criteria that might be used but were not mentioned would be ease of specification and whether the method can handle intermediate results larger than memory.Gaborhttps://www.blogger.com/profile/11757402795312667103noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-80104335686449000532011-03-18T13:02:08.709-07:002011-03-18T13:02:08.709-07:00Thank you for your suggestions Larry and ts. Indee...Thank you for your suggestions Larry and ts. Indeed when I use .Internal() data.table is the fastest!!<br /><br />John, I don't have a lot of experience with plyr. Right now, I'm using sqldf. I learned about data.table yesterday. Seems like I will be using that too.altunahttps://www.blogger.com/profile/04295081899402275119noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-85640398939962338472011-03-18T08:43:19.110-07:002011-03-18T08:43:19.110-07:00In R, mean has high overhead. You can replace mean...In R, mean has high overhead. You can replace mean(x) with .Internal(mean(x)) to see significant improvement in many of the approaches. The speedup in data.table is most noticeable. My timings are:<br /><br />> rsqldf<br /> user system elapsed <br /> 1.24 0.75 2.17 <br /><br />> rdataT # original<br /> user system elapsed <br /> 6.42 0.02 6.50 <br /><br />> rdataT # using .Internal(mean(x))<br /> user system elapsed <br /> 0.15 0.00 0.18 <br /><br />See also here:<br /><br />http://stackoverflow.com/questions/4322219/tshorthttps://www.blogger.com/profile/11040735264408542307noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-36977807936156392832011-03-18T08:35:27.844-07:002011-03-18T08:35:27.844-07:00Thank you for the short study. Been a plyr user, ...Thank you for the short study. Been a plyr user, but looking at this I am definitely going to give the others packages a try. Any of the other packages have shortcomings compared to plyr?JohnnyPhysixhttps://www.blogger.com/profile/00171537169096306400noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-54887372512994968722011-03-18T06:13:41.255-07:002011-03-18T06:13:41.255-07:00Not a huge time saver but in the aggregate functio...Not a huge time saver but in the aggregate function the FUN argument can be just "mean" instead of "function(x) mean(x)". It seemed to save about a second on my system.Larryhttp://www.ieortools.comnoreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-19550963480118005872011-03-17T18:20:13.358-07:002011-03-17T18:20:13.358-07:00Thanks, I followed your suggestion and added data....Thanks, I followed your suggestion and added data.table as well.altunahttps://www.blogger.com/profile/04295081899402275119noreply@blogger.comtag:blogger.com,1999:blog-6721211197199573684.post-13152699469369857582011-03-17T15:04:29.573-07:002011-03-17T15:04:29.573-07:00Perhaps you could give data.table a try.Perhaps you could give <a href="http://datatable.r-forge.r-project.org/" rel="nofollow">data.table</a> a try.Anonymousnoreply@blogger.com