Recipes, scripts and genomics: methylation

Showing posts with label methylation. Show all posts

Friday, June 19, 2015

Segmentation of methylation profiles using methylKit

methylKit is an R package for DNA methylation analysis and annotation using high-throughput bisulfite sequencing data. We recently included a new function in methylKit called methSeg(). The function can segment methylation profiles (methylRaw objects) or differential methylation profiles (methylDiff objects). Currently the function is available on a different branch for development. You can install the version of methylKit that has the segmentation functionality as shown below. We will move it to the master branch once we have more feedback for the function.

Rationale

Regulatory regions can be discovered by analyzing methylation patterns. One of the popular methods is to segment methylomes into distinct categories such as regions with low methylation or high methylation. The most popular method for categorical methylome segmentation is based on hidden Markov models (HMM) (an example is here). Another segmentation strategy is to use variants of change-point analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copy-number variation) detection but have applications in methylation context as well.

In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. The segment groups correspond to highly methylated regions or lowly methylated regions. Since this is an unsupervised method, we can find more patterns than just highly or lowly methylated regions. The users have to further overlap the segments with other marks and annotation such as promoters, H3K27ac or other histone modifications to associate segment groups with functional annotation.

Use case and example

Below, we are using methylation data from Roadmap Epigenomics H1 cell line. The code demonstrates how to use the function and export the results to a BED file. Similarly, we can apply this function on differential methylation profiles, where values range between -100 and 100. Do help(methSeg) to see the help page for the function.

Wednesday, July 24, 2013

Q&A forum for methylKit

methylKit is an R package for DNA methylation analysis and annotation using high-throughput bisulfite sequencing data. We recently activated a Q&A forum for methylKit related questions. If you are using methylKit and if you have questions about it, it is a great place to ask questions and browse previously answered questions.

You can reach it via https://groups.google.com/forum/#!forum/methylkit_discussion

Tuesday, March 19, 2013

EpiWorkshop 2013: DNA methylation analysis in R

Elemento Lab at Weill Cornell Medical College organized a workshop on Epigenomics. I had the opportunity to give a tutorial on DNA methylation analysis in R. The tutorial demonstrates how to analyze high-throughput bisulfite sequencing data ( RRBS is used as an example in the tutorial) with R package methylKit.

My slides are below and all of the workshop material is here at the workshop website.

DNA methylation analysis in R from Altuna Akalin

Friday, March 1, 2013

Using ENCODE methylation data (RRBS) in R

ENCODE project has generated reduced-representation bilsulfite sequencing data for multiple cell lines. The data is organized in an extended bed format with additional columns denoting % methylation and coverage per base. Luckily, this sort of generic % methylation information can be read in by R package methylKit, which is a package for analyzing basepair resolution 5mC and 5hmC data.

The code snippets below show how to read RRBS bed file produced by ENCODE. But, let's first download the files.

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/wgEncodeHaibMethylRrbsA549Dm002p7dHaibSitesRep1.bed.gz

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/wgEncodeHaibMethylRrbsAg04449UwstamgrowprotSitesRep1.bed9.gz

Unfortunately, methylKit currently can not read them directly because the track definition line causes a problem. It should be deleted from each bed file. Ideally, methylKit should be able to skip over lines (this issue should be fixed in later versions)
For now, we have to use some unix tools to remove the first lines from the bed files. You run the code below in your terminal. This set of commands will delete the first line from every *.gz file in the directory so be careful.

for files in *.gz
do
  gzip -dc "$files" | tail +2 | gzip -c > "$files".tmp
  if [ "$?" -eq 0 ]; then
    mv "$files".tmp "$files"
  fi
done

Now we can read the files using methylKit. The pipeline argument defines which columns in the file are corresponding to chr,start,end,strand, percent methylation and coverage:
You can also read multiple files at a time:

Since we have read the files and now they are methylKit objects, we can use all of the methylKit functionality on these objects. For example, the code below plots the distribution of methylation % for covered bases.


getMethylationStats(obj[[1]], plot = TRUE)

You can check the methylKit vignette and the website for the rest of the functionality and details.

Friday, October 5, 2012

How to read BSMAP methylation ratio files into R via methylKit

BSMAP is an aligner for bisulfite sequencing reads. It outputs aligned reads as well as methylation ratios per base (via methratio.py script). The methylation ratios can be read into R via methylKit package and regular methylKit analysis can be performed using the BSMAP data.

First, you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner. If you want to analyze Cs from all contexts then of course you don't need to separate the file. For example, to get all Cs in the CpG context you will need to do something like following:

awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt

Here is how methylation ratio file from BSMAP looks like (file includes only Cs in CpG context):

chr  pos     strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 +      CGCGT   0.000 56      0       0.000    0.064
chr1 3121597 +      ATCGG   0.000 56      0       0.000    0.064
chr1 3121599 +      GTCGT   0.000 56      0       0.000    0.064
chr1 3121605 +      CTCGG   0.000 56      0       0.000    0.064
chr1 3121606 +      TGCGC   0.000 56      0       0.000    0.064
chr1 3121607 +      GGCGC   0.000 56      0       0.000    0.064
chr1 3121611 +      CTCGA   0.000 56      0       0.000    0.064
chr1 3121614 +      TACGC   0.000 56      0       0.000    0.064
chr1 3121631 +      CTCGT   0.000 56      0       0.000    0.064

You can read the methylation ratio file by using "pipeline" argument in read() function. You need to provide a list of column numbers corresponding to chr,start,end,strand,coverage and ratio of methylation. Actually, you can read any generic methylation ratio or percentage file using this method. The file needs to have the location information (chr,start,end and strand), coverage information and methylation percentage or ratio information.

Thanks to Maxime Caron for sharing the BSMAP methylation ratio file.

Reference for methylKit:
Altuna Akalin, Matthias Kormaksson, Sheng Li, Francine E. Garrett-Bakelman, Maria E. Figueroa, Ari Melnick, Christopher E. Mason.(2012). "methylKit: A comprehensive R package for the analysis of genome-wide DNA methylation profiles." Genome Biology , 13:R87.

PS: For the newer releases of BSMAP (~v2.73), you may need to coerce the CT counts to an integer. BSMAP ratio extractor may output effective CT counts as float but they should be coerced to integer before they are read in by methylKit. methylKit (>= v0.5.6) will automatically coerce those values to nearest integer. However, you can also use awk to coerce float to integer values. The example below coerces the 6th column of the the file to integer

awk '{OFS="\t";print $1,$2,$3,$4,$5,int($6);}' sample.BSMAP.txt

Thursday, June 21, 2012

Plotting differentially methylated bases on an ideogram

Ideogram is a popular way to show genomic features in relation to chromosome sizes and chromosomal location in bulk. Recently, we have shown differentially methylated CpGs on a chromosomal ideogram in our PLoS Genetics paper. Although it doesn't always make sense to show differentially methylated bases on an ideogram, I'm describing a function below that does a similar job.
The function can be used to plot ideograms for differentially methylated CpGs. The function requires methylKit, GenomicRanges and ggbio packages. GenomicRanges and ggbio are available from bioconductor

First, we need to get chromosome lengths that are needed to define chromosome boundaries. I used BSgenome.Hsapiens.UCSC.hg18 package from Bioconductor to get the chromosome lengths of hg18 assembly.

library(BSgenome)
library("BSgenome.Hsapiens.UCSC.hg18")
chr.len = seqlengths(Hsapiens)  # get chromosome lengths
# remove X,Y,M and random chromosomes
chr.len = chr.len[grep("_|M|X|Y", names(chr.len), invert = T)]

Now, we need to download a methlyDiff object to be plotted. This can be any methylDiff object from methylKit.


download.file("http://methylkit.googlecode.com/files/myDiff.rda", 
    destfile = "myDiff.rda")
load("myDiff.rda")

Next, we plot the ideogram using the chromosome length and methylDiff object. “difference” is the percent methylation difference threshold and “qvalue” is the qvalue threshold, both used to define differentially methylated bases in the function. The differentially methylated bases satisfying “difference” and “qvalue” thresholds are plotted in a color coded manner in the ideogram.

ideoDMC(myDiff, chrom.length = chr.len, difference = 25, qvalue = 0.01, 
    circos = FALSE, title = "test", hyper.col = "magenta", hypo.col = "green")

We can also plot a circular ideogram similar to circos plots using the same function.

ideoDMC(myDiff, chrom.length = chr.len, difference = 25, qvalue = 0.01, 
    circos = TRUE, title = "test", hyper.col = "magenta", hypo.col = "green")

Tuesday, December 20, 2011

methylKit: R package for DNA methylation analysis

High-throughput bisulfite sequencing based methods are popular for measuring genome-wide DNA methylation levels. Here is an R package that helps with the analysis of such DNA methylation data. Although, it is still under heavy development current functionality allows users to do many of the essential data analysis tasks. The package is primarily designed for Reduced Representation Bisulfite Sequencing (RRBS), however it can also handle whole-genome bisulfite sequencing if proper input format is provided.

Functionality includes:

Read coverage statistics
Methylation statistics
Sample correlation and clustering
Differential methylation analysis
Feature annotation and accessor/coercion functions
Multiple visualization options
Regional and tiling windows analysis
(Almost) proper documentation

The package is here:
http://code.google.com/p/methylkit/

Pages