First, you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner. If you want to analyze Cs from all contexts then of course you don't need to separate the file. For example, to get all Cs in the CpG context you will need to do something like following:
awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt
Here is how methylation ratio file from BSMAP looks like (file includes only Cs in CpG context):
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 + CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 + TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 + GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 + CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 + TACGC 0.000 56 0 0.000 0.064
chr1 3121631 + CTCGT 0.000 56 0 0.000 0.064
You can read the methylation ratio file by using "pipeline" argument in read() function. You need to provide a list of column numbers corresponding to chr,start,end,strand,coverage and ratio of methylation. Actually, you can read any generic methylation ratio or percentage file using this method. The file needs to have the location information (chr,start,end and strand), coverage information and methylation percentage or ratio information.
Thanks to Maxime Caron for sharing the BSMAP methylation ratio file.
Reference for methylKit:
Altuna Akalin, Matthias Kormaksson, Sheng Li, Francine E. Garrett-Bakelman, Maria E. Figueroa, Ari Melnick, Christopher E. Mason.(2012). "methylKit: A comprehensive R package for the analysis of genome-wide DNA methylation profiles." Genome Biology , 13:R87.
PS: For the newer releases of BSMAP (~v2.73), you may need to coerce the CT counts to an integer. BSMAP ratio extractor may output effective CT counts as float but they should be coerced to integer before they are read in by methylKit. methylKit (>= v0.5.6) will automatically coerce those values to nearest integer. However, you can also use awk to coerce float to integer values. The example below coerces the 6th column of the the file to integer
awk '{OFS="\t";print $1,$2,$3,$4,$5,int($6);}' sample.BSMAP.txt
Dear Altuna,
ReplyDeleteReferred to this sentence:
"you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner."
Could you please write the Perl or awk one liner example, so that we can use it for the task you refer to?
With best wishes
You can use something like this to get Cs in the CpG context
ReplyDeleteawk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt
for CHH and CHG context you will probably need to write a slightly more complex regular expression.
There's another problem I've come across in this workflow. Once I have the BSMPA2.7 results in bsp format, I get the following error while running methratio.py:
ReplyDeleteTraceback (most recent call last):
File "methratio.py", line 81, in
meth[cr] = array.array('I', [0]) * len(ref[cr])
MemoryError
To my understanding (I don't know any Python yet) it's due to a Python module but I have no clue which module is involved.
Has anybody else experienced this error using methratio.py?
Thanks in advance you for your attention
Hi Jose Luis,
DeleteI think you should take this to BSMAP developers or their e-mail group, they should be able to help you on this.
cheers
I think it should be awk '($3=="-" && $4~/^.{2}GC/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt ( for minus strand it should be ($3=="-" && $4~/^.{2}GC/ ))
ReplyDeleteI think not, BSMAP readme.txt documents says that the strand of the 5bp sequence is the plus strand irrespective of the strand of the covered CpG . So a CpG on the minus strand will appear as ".{1}CG". For example, If the strand="-" the following string denotes a CpG on the minus strand: ACGAA. The minus strand of the same string will appear as: TTCGT
ReplyDeleteHi, Thanks for the info on reading in the BSMAP files. It was really helpful. I
ReplyDeletewas wondering if you have any advice for generating a methylRawList from individual files that were read in this way.
The example given in the read{methylKit} help file only shows how to generate a methylRaw object from 'generic' files, I can't figure out how to rework the example to generate a methylRawList from these individual objects.
Thanks
you should be able to give a list of file locations, keep the "pipeline" argument as shown above, give "sample.id" as a list, also use "treatment" argument, And that will read a list of files and produce a methylRawList
Deletesome thing like the following (file.list is a list of file locations)
myobj=read( file.list,pipeline=list(fraction=TRUE,chr.col=1,start.col=2,end.col=2,
coverage.col=6,strand.col=3,freqC.col=5 ),
sample.id=list("test1","test2","ctrl1","ctrl2"),assembly="hg18",treatment=c(1,1,0,0))
Wonderful! Thank you
Delete