In R, you can treat tables (or data.frames as they are called in R) as SQL tables. That means you can query them as you would query a database with SQL commands. This is particularly useful 1) if you know SQL, hahah:) 2) if you have large tables with millions of rows. In R, querying a database will be much faster than iterating through the rows of a million row table.
Say you have a table of exon locations and RPKM value for each exon and you want to get the total RPKM value for each transcript:
>head(rnaEx)
ex_rank ex_name tx_chr tx_start tx_end score len strand
1 heart_k25:1002070u:2.9 chrX 101864664 101864760 1314.937 65 +
2 heart_k25:1002070u:2.9 chrX 101864664 101864760 0.000 5 +
1 heart_k25:1002622u:3.0 chrX 96534438 96534487 1055.568 50 -
# load the library that can treat tables as SQL tables
>require(sqldf)
# query the table with SQL command: group by transcripts and sum up the scores
>int.res=sqldf("select ex_name,tx_chr,tx_start,tx_
>head(int.res)
ex_name tx_chr tx_start tx_end strand sum_score len
heart_k25:100000u:2.5 chrX 73370385 73370481 + 2448.919 97 heart_k25:1000027u:5.5 chr1 93550994 93551053 - 2352.406 60
heart_k25:1000076u:2.5 chr5 15764153 15764204 - 1140.013 52 heart_k25:1000078:2.6 chr11 116415816 116415975 + 3787.981 160
Check here for more info on sqldf:
Similar functionality also exists in R through another package called plyr. Although, I feel like sqldf will be faster for large data.frames, I have no evidence on that at the moment. plyr is described here and here, and check below for a quick intro:
No comments:
Post a Comment