The code snippets below show how to read RRBS bed file produced by ENCODE. But, let's first download the files.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/wgEncodeHaibMethylRrbsA549Dm002p7dHaibSitesRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/wgEncodeHaibMethylRrbsAg04449UwstamgrowprotSitesRep1.bed9.gz
Unfortunately, methylKit currently can not read them directly because the track definition line
causes a problem. It should be deleted from each bed file.
Ideally, methylKit should be able to skip over lines (this issue should be fixed in later versions)For now, we have to use some unix tools to remove the first lines from the bed files. You run the code below in your terminal. This set of commands will delete the first line from every *.gz file in the directory so be careful.
for files in *.gz
do
gzip -dc "$files" | tail +2 | gzip -c > "$files".tmp
if [ "$?" -eq 0 ]; then
mv "$files".tmp "$files"
fi
done
Now we can read the files using methylKit. The pipeline argument defines which columns in the file
are corresponding to chr,start,end,strand, percent methylation and coverage:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(methylKit) | |
obj = read("wgEncodeHaibMethylRrbsA549Dm002p7dHaibSitesRep1.bed.gz", | |
sample.id = "test", assembly = "hg19", header = FALSE, | |
context = "CpG", resolution = "base", | |
pipeline = list(fraction = FALSE, chr.col = 1, start.col = 3, | |
end.col = 3,coverage.col = 5, strand.col = 6, | |
freqC.col = 11)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
file.list = list("wgEncodeHaibMethylRrbsA549Dm002p7dHaibSitesRep1.bed.gz", | |
"wgEncodeHaibMethylRrbsAg04449UwstamgrowprotSitesRep1.bed9.gz") | |
obj = read(file.list, sample.id = list("test", "control"), treatment = c(1,0), | |
assembly = "hg19", header = FALSE, context = "CpG", resolution = "base", | |
pipeline = list(fraction = FALSE, chr.col = 1, start.col = 3, end.col = 3, | |
coverage.col = 5, strand.col = 6, freqC.col = 11)) | |
obj | |
## methylRawList object with 2 methylRaw objects | |
## | |
## methylRaw object with 1464031 rows | |
## -------------- | |
## id chr start end strand coverage numCs numTs | |
## 1 chr1.100002990 chr1 100002990 100002990 + 3 1 2 | |
## 2 chr1.100003154 chr1 100003154 100003154 - 19 1 18 | |
## 3 chr1.1000171 chr1 1000171 1000171 + 26 7 19 | |
## 4 chr1.1000191 chr1 1000191 1000191 + 26 14 12 | |
## 5 chr1.1000192 chr1 1000192 1000192 - 22 9 13 | |
## 6 chr1.1000199 chr1 1000199 1000199 + 26 15 11 | |
## -------------- | |
## sample.id: test | |
## assembly: hg19 | |
## context: CpG | |
## resolution: base | |
## | |
## methylRaw object with 1233588 rows | |
## -------------- | |
## id chr start end strand coverage numCs numTs | |
## 1 chr1.1000171 chr1 1000171 1000171 + 22 1 21 | |
## 2 chr1.1000191 chr1 1000191 1000191 + 22 4 18 | |
## 3 chr1.1000192 chr1 1000192 1000192 - 12 4 8 | |
## 4 chr1.1000199 chr1 1000199 1000199 + 22 11 11 | |
## 5 chr1.1000200 chr1 1000200 1000200 - 12 4 8 | |
## 6 chr1.1000207 chr1 1000207 1000207 - 12 4 8 | |
## -------------- | |
## sample.id: control | |
## assembly: hg19 | |
## context: CpG | |
## resolution: base | |
## | |
## treament: 1 0 |
Since we have read the files and now they are methylKit objects, we can use all of the methylKit functionality on these objects. For example, the code below plots the distribution of methylation % for covered bases.
getMethylationStats(obj[[1]], plot = TRUE)
You can check the methylKit vignette
and the website for the rest of the functionality and details.
I tried the above. But I get the following error message
ReplyDelete> library(methylKit)
> obj = read("wgEncodeHaibMethylRrbsA549Dm002p7dHaibSitesRep1.bed.gz",
+ sample.id = "test",assembly = "hg19",header = FALSE,
+ context = "CpG",resolution ="base",
+ pipeline = list(fraction = FALSE,chr.col = 1, start.col = 3, end.col = 3,
+ coverage.col = 5,strand.col = 6, freqC.col = 11))
Error in .local(location, sample.id, assembly, pipeline, header, context, :
unknown 'pipeline' argument, supported alignment pipelines: amp
In addition: Warning message:
In if (pipeline %in% c("amp", "bismark")) { :
the condition has length > 1 and only the first element will be used
>
Try the latest version of the package with the appropriate version of R.
DeleteThanks for the post. Maybe it's better to use coverage.col = 10, which is the reads count or coverage for given site. The 5th column is the score of the track, which is equal to coverage when coverage is not greater than 1000.
ReplyDeleteGreat post! Thank you so much for sharing..
ReplyDeleteFor those who want to learn R Programming, here is a great new course on youtube for beginners and Data Science aspirants. The content is great and the videos are short and crisp. New ones are getting added, so I suggest to subscribe.
https://www.youtube.com/watch?v=BGWVASxyow8&list=PLFAYD0dt5xCzTQHDhMPZwBoaAXWeVhZzg&index=19
Great idea!
ReplyDeleteHi my friend, Can you show me how to insert R or Perl Script into the blogs. Thanks
ReplyDeleteHi my friend, Can you show me how to insert R or Perl Script into the blogs. Thanks
ReplyDelete