Tuesday, February 8, 2011

sqldf and grouping rows in R

In R, you can treat tables (or data.frames as they are called in R) as SQL tables. That means you can query them as you would query a database with SQL commands. This is particularly useful 1) if you know SQL, hahah:)  2) if you have large tables with millions of rows. In R, querying a database will be much faster than iterating through the rows of a million row table.

Say you have a table of exon locations and RPKM value for each exon and you want to get the total RPKM value for each transcript:


>head(rnaEx)
  ex_rank                ex_name tx_chr  tx_start    tx_end     score len   strand
       1 heart_k25:1002070u:2.9   chrX 101864664 101864760  1314.937  65     +
       2 heart_k25:1002070u:2.9   chrX 101864664 101864760     0.000   5     +
       1 heart_k25:1002622u:3.0   chrX  96534438  96534487  1055.568  50     -



# load the library that can treat tables as SQL tables
>require(sqldf)

# query the table with SQL command: group by transcripts and sum up the scores
>int.res=sqldf("select ex_name,tx_chr,tx_start,tx_
end,strand,SUM(score) AS sum_score,SUM(len) AS len from rnaEx GROUP BY ex_name,tx_chr,tx_start,tx_end")

>head(int.res)
                 ex_name tx_chr  tx_start    tx_end strand sum_score len
  heart_k25:100000u:2.5   chrX  73370385  73370481      +  2448.919  97
 heart_k25:1000027u:5.5   chr1  93550994  93551053      -  2352.406  60
 heart_k25:1000076u:2.5   chr5  15764153  15764204      -  1140.013  52
  heart_k25:1000078:2.6  chr11 116415816 116415975      +  3787.981 160


Check here for more info on sqldf:

Similar functionality also exists in R through another package called plyr. Although, I feel like sqldf will be faster for large data.frames, I have no evidence on that at the moment. plyr is described here and here, and check below for a quick intro:

No comments:

Post a Comment