Recipes, scripts and genomics: How to read BSMAP methylation ratio files into R via methylKit

Friday, October 5, 2012

How to read BSMAP methylation ratio files into R via methylKit

BSMAP is an aligner for bisulfite sequencing reads. It outputs aligned reads as well as methylation ratios per base (via methratio.py script). The methylation ratios can be read into R via methylKit package and regular methylKit analysis can be performed using the BSMAP data.

First, you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner. If you want to analyze Cs from all contexts then of course you don't need to separate the file. For example, to get all Cs in the CpG context you will need to do something like following:

awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt

Here is how methylation ratio file from BSMAP looks like (file includes only Cs in CpG context):

chr  pos     strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 +      CGCGT   0.000 56      0       0.000    0.064
chr1 3121597 +      ATCGG   0.000 56      0       0.000    0.064
chr1 3121599 +      GTCGT   0.000 56      0       0.000    0.064
chr1 3121605 +      CTCGG   0.000 56      0       0.000    0.064
chr1 3121606 +      TGCGC   0.000 56      0       0.000    0.064
chr1 3121607 +      GGCGC   0.000 56      0       0.000    0.064
chr1 3121611 +      CTCGA   0.000 56      0       0.000    0.064
chr1 3121614 +      TACGC   0.000 56      0       0.000    0.064
chr1 3121631 +      CTCGT   0.000 56      0       0.000    0.064

You can read the methylation ratio file by using "pipeline" argument in read() function. You need to provide a list of column numbers corresponding to chr,start,end,strand,coverage and ratio of methylation. Actually, you can read any generic methylation ratio or percentage file using this method. The file needs to have the location information (chr,start,end and strand), coverage information and methylation percentage or ratio information.

Thanks to Maxime Caron for sharing the BSMAP methylation ratio file.

Reference for methylKit:
Altuna Akalin, Matthias Kormaksson, Sheng Li, Francine E. Garrett-Bakelman, Maria E. Figueroa, Ari Melnick, Christopher E. Mason.(2012). "methylKit: A comprehensive R package for the analysis of genome-wide DNA methylation profiles." Genome Biology , 13:R87.

PS: For the newer releases of BSMAP (~v2.73), you may need to coerce the CT counts to an integer. BSMAP ratio extractor may output effective CT counts as float but they should be coerced to integer before they are read in by methylKit. methylKit (>= v0.5.6) will automatically coerce those values to nearest integer. However, you can also use awk to coerce float to integer values. The example below coerces the 6th column of the the file to integer

awk '{OFS="\t";print $1,$2,$3,$4,$5,int($6);}' sample.BSMAP.txt

29 comments:

JLuisNovember 5, 2012 at 5:53 AM
Dear Altuna,

Referred to this sentence:
"you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner."

Could you please write the Perl or awk one liner example, so that we can use it for the task you refer to?

With best wishes
ReplyDelete
Replies
altunaNovember 5, 2012 at 6:49 AM
You can use something like this to get Cs in the CpG context
awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt

for CHH and CHG context you will probably need to write a slightly more complex regular expression.
ReplyDelete
Replies
JLuisNovember 7, 2012 at 1:20 AM
There's another problem I've come across in this workflow. Once I have the BSMPA2.7 results in bsp format, I get the following error while running methratio.py:

Traceback (most recent call last):
File "methratio.py", line 81, in
meth[cr] = array.array('I', [0]) * len(ref[cr])
MemoryError

To my understanding (I don't know any Python yet) it's due to a Python module but I have no clue which module is involved.
Has anybody else experienced this error using methratio.py?

Thanks in advance you for your attention
ReplyDelete
Replies
UnknownDecember 11, 2012 at 2:03 AM
I think it should be awk '($3=="-" && $4~/^.{2}GC/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt ( for minus strand it should be ($3=="-" && $4~/^.{2}GC/ ))
ReplyDelete
Replies
altunaDecember 11, 2012 at 8:50 AM
I think not, BSMAP readme.txt documents says that the strand of the 5bp sequence is the plus strand irrespective of the strand of the covered CpG . So a CpG on the minus strand will appear as ".{1}CG". For example, If the strand="-" the following string denotes a CpG on the minus strand: ACGAA. The minus strand of the same string will appear as: TTCGT
ReplyDelete
Replies
AnonymousDecember 28, 2012 at 2:14 PM
Hi, Thanks for the info on reading in the BSMAP files. It was really helpful. I

was wondering if you have any advice for generating a methylRawList from individual files that were read in this way.

The example given in the read{methylKit} help file only shows how to generate a methylRaw object from 'generic' files, I can't figure out how to rework the example to generate a methylRawList from these individual objects.

Thanks
ReplyDelete
Replies
Claire MorganJuly 24, 2013 at 11:26 AM
Hello,

Firstly, thanks for posting details on how to use methylKit on BSMAP.

I have recently installed BSMAPv2.74 and the methratio.py program outputted a file with the following headers:

chr
pos
strand
context ratio
eff_CT_count
C_count
CT_count
rev_G_count
rev_GA_count
CI_lower
CI_upper

I can't find documentation that will help me relate these 11 headers to the 9 headers you have written about in your initial post. Some of them are of course obvious, but I'm quite confused trying to link 'total_C' and 'methy_C' with 'eff_CT_count', 'C_count' and 'CT_count'.

Do you know the answer, or could you post a link to where this might be described?

Kind regards
ReplyDelete
Replies
Claire MorganJuly 29, 2013 at 7:09 AM
This comment has been removed by the author.
ReplyDelete
Replies
Selva PrabhakaranMay 27, 2014 at 6:33 AM
Great post! Thank you so much for sharing..

For those who want to learn R Programming, here is a great new course on youtube for beginners and Data Science aspirants. The content is great and the videos are short and crisp. New ones are getting added, so I suggest to subscribe.

https://www.youtube.com/watch?v=BGWVASxyow8&list=PLFAYD0dt5xCzTQHDhMPZwBoaAXWeVhZzg&index=19
ReplyDelete
Replies
varsharaoJune 16, 2014 at 7:54 PM
This comment has been removed by the author.
ReplyDelete
Replies
vnaNovember 25, 2015 at 9:28 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownAugust 31, 2016 at 6:23 PM
Why is the position of the CG important? And why is the position of the CG different on the - vs + strand? (^.{2}CG vs ^.{1}CG).

I'm trying to calculate Cs in CpG, CHH and CHG context and it would be nice to know what is going on.
ReplyDelete
Replies
AnonymousJune 6, 2017 at 1:09 PM
For those interested in what and awk statement would look like for all C contexts:

awk '(NR>1){if(($3=="-" && $4~/^.CG../ ) || ($3=="+" && $4~/^..CG./)) print $1"\t"$2-1"\t"$2"\t"$3"\t""CG""\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12; else if(($3=="-" && $4~/^C[AGT]G../ ) || ($3=="+" && $4~/^..C[ACT]G/)) print $1"\t"$2-1"\t"$2"\t"$3"\t""CHG""\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12; else if(($3=="-" && $4~/^[AGT][AGT]G../ ) || ($3=="+" && $4~/^..C[ACT][ACT]/)) print $1"\t"$2-1"\t"$2"\t"$3"\t""CHH""\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12; else print $1"\t"$2-1"\t"$2"\t"$3"\t""CNN""\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12}' BSMAPratio/infile.txt > BSMAPratio/BSMAP_outfile.txt
ReplyDelete
Replies
UnknownMay 15, 2019 at 4:09 PM
If I have 14 BSMAP methylation ratio files how to I read all of them in as a list, like you are supposed to in the methylKit process?
ReplyDelete
Replies
saimaMarch 25, 2021 at 6:29 AM
I have a bed file output from bsmap consisting of chr, start, end, methylation ratio, coverage but the strand is split by coverage (eg. 36 sites split into 31 on + strand and 5 on – strand). So how can I prepare the strand column for methylkit read function?
ReplyDelete
Replies
M.KentMay 25, 2021 at 7:32 AM
Original BSMAP is so old that a little bit difficult to install(because of difference of char), and improved one is posted on github as BSMAPz.
https://github.com/zyndagj/BSMAPz

Methylation ratio file generated from BSMAPz's methratio.py has a difference in context section. It's already written as CHH, CpG, CHG(not like CGCGT).

Then maybe this awk will be working if you use BSMAPz and added the command "-i no-action"

awk '($4=="CG")' BSMAPzexample.txt > CpG.txt
ReplyDelete
Replies
ADMINJune 18, 2022 at 3:28 AM
A fine quality educational blog! I like the way blogger presented information regarding the concerned subject.
Best engineering college in Dehradun
ReplyDelete
Replies
Sharon JohnsonDecember 9, 2022 at 12:41 AM
You are to be thanked for sharing this significant piece and for this truly outstanding profile. I'm going to share a phony Twitter account-generating profile with you. With the aid of a fake Twitter account generator, you can create a celebrity account without registering and share it with your close friends. For additional information, see this profile create fake twitter account.
ReplyDelete
Replies
marryDecember 28, 2022 at 2:59 AM
One of the most crucial pieces of knowledge for me is this. We occasionally compete with one another. When we play online games, we believe that the number of spacebars we use is crucial. You can increase your spacebar count here. For further details, go here spacebar click counter.
ReplyDelete
Replies

Add comment

Pages

Friday, October 5, 2012

How to read BSMAP methylation ratio files into R via methylKit

29 comments: