Friday, June 19, 2015

Segmentation of methylation profiles using methylKit

methylKit is an R package for DNA methylation analysis and annotation using high-throughput bisulfite sequencing data. We recently included a new function in methylKit called methSeg(). The function can segment methylation profiles (methylRaw objects) or differential methylation profiles (methylDiff objects). Currently the function is available on a different branch for development. You can install the version of methylKit that has the segmentation functionality as shown below. We will move it to the master branch once we have more feedback for the function.


Regulatory regions can be discovered by analyzing methylation patterns. One of the popular methods is to segment methylomes into distinct categories such as regions with low methylation or high methylation. The most popular method for categorical methylome segmentation is based on hidden Markov models (HMM) (an example is here). Another segmentation strategy is to use variants of change-point analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copy-number variation) detection but have applications in methylation context as well.

In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. The segment groups correspond to highly methylated regions or lowly methylated regions. Since this is an unsupervised method, we can find more patterns than just highly or lowly methylated regions. The users have to further overlap the segments with other marks and annotation such as promoters, H3K27ac or other histone modifications to associate segment groups with functional annotation.

Use case and example

Below, we are using methylation data from Roadmap Epigenomics H1 cell line. The code demonstrates how to use the function and export the results to a BED file. Similarly, we can apply this function on differential methylation profiles, where values range between -100 and 100. Do help(methSeg) to see the help page for the function.

Monday, April 13, 2015

Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges (or IRanges if you don't need to compare intervals with an associated chromosome and strand). foverlaps is from the data.table package and is inspired by findOvelaps.

In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.

In the foverlaps function Y has to be indexed using the setkey function (we don't have to do it on X). The key is intended to speed-up finding overlaps.

Which one is faster?

To check this we used the benchmark function from the rbenchmark package. It's a simple wrapper of the system.time function.

The code below plots the execution time of both functions for increasing numbers of rows of data set X.

Interestingly,  foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.

We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.

Information about my R session:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.4 (Mavericks)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] data.table_1.9.4     rbenchmark_1.0.0     GenomicRanges_1.18.4
[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        S4Vectors_0.4.0     
[7] BiocGenerics_0.12.1 

loaded via a namespace (and not attached):
[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 
[6] tools_3.1.3    XVector_0.6.0

Tuesday, March 31, 2015

Using genomation to analyze methylation profiles from Roadmap epigenomics and ENCODE

The genomation package is a toolkit for annotation and visualization of various genomic data. The package is currently also in BioC. It allows to analyze high-throughput data, including bisulfite sequencing data. Here, we will visualize the distribution of CpG methylation around promoters and their locations within gene structures on human chromosome 3.

Heatmap and plot of meta-profiles of CpG methylation around promoters

In this example we use data from Reduced Representation Bisulfite Sequencing (RRBS) and
Whole-genome Bisulfite Sequencing (WGBS) techniques and H1 and IMR90 cell types
derived from the ENCODE and the Roadmap Epigenomics Project databases.

We download the datasets and convert them to GRanges objects. Using rtracklayer and genomation functions. We also use a refseq bed file for annotation and extraction of promoter regions using readTranscriptFeatures function.

Since we have read the files now we can build base-pair resolution matrices of scores(methylation values) for each experiment. The returned list of matrices can be used to draw heatmaps or meta profiles of methylation ratio around promoters.

Distribution of covered CpGs across gene regions

genomation facilitates visualization of given locations of features aggregated by  exons, introns, promoters and TSSs. To find the distribution of covered CpGs within these gene structures, we will use transcript features we previously obtained. Here is the breakdown of the code

  1. Count overlap statistics between our CpGs from WGBS and RRBS H1 cell type and gene structures
  2. Calculate percentage of CpGs overlapping with annotation
  3. plot them in a form of pie charts

Tuesday, February 18, 2014

summary, annotation and visualization of genomic intervals with genomation package

We have been working on a new R package, genomation, that will help with the analysis of genomic intervals. Briefly, the package contains a collection of tools for visualizing and analyzing genome-wide data sets. The package works with a variety of genomic interval file types (BAM, bigWig, BED, GFF and any tabular format containing genomic locations) and enables easy summarization and annotation of high throughput data sets with given genomic annotations. Below is a presentation I'm preparing for an upcoming internal meeting. You can find more information and the vignette on the webpage.

genomation is published in Bioinformatics
Akalin A, Franke V, Vlahoviček K, Mason CE, Schübeler D. genomation: a toolkit to summarize, annotate and visualize genomic intervals. Bioinformatics. 2014 Nov 21. pii: btu775

It is also currently in the Bioconductor development version

Sunday, August 25, 2013

Using JavaScript visualization libraries with R

This is a short tutorial on knitr/markdown and JS visualization packages googleVis and rCharts. With these packages you can create web pages with interactive visualizations just using R. This will require minimal or no knowledge of HTML or JavaScript. 
You need to have the following R packages and their dependencies installed:
  • knitr
  • googleVis
  • rCharts (not on CRAN)
The tutorial is organized in four parts. First two parts introduce basics about knitr and markdown. The last parts are about using googleVis and rCharts packages in markdown documents. However, the tutorial does not have in-depth examples that show all the capabilities of rCharts and googleVis.

You can download the .Rmd files (or clone the repository from github) and run knit2html() on them in your R console, or if you are using RStudio you can click "knit HTML" button on the upper left corner.

The best way to go through the tutorial is to examine the code chunks and explanations in .Rmd files, and then check the HTML output from knit2html().

1. R and markdown

markdown_knitr.Rmd shows basics of markdown and knitr integration. These tools will help you create an HTML document using R. knit2html()output is here. In addition, R markdown basics are described here.

2. Customizing code output in R markdown documents

controlling_knitr.Rmd shows how code chunk output can be controlled by knitr options. knit2html() output is here

3. Using Google visualization API in R markdown documents

googleVis.Rmd shows how to use googleVis package in a markdown document.You can incorporate plots from Google Visualization API in your R markdown document, which will be converted to an HTML document by knit2html(), the HTML output is here.

4. Using multiple JS visualization libraries in R markdown documents

rCharts.Rmd shows how to use rCharts package in a markdown document. Using rCharts, you can incorporate various JS visualizations (such as Polychart, NVD3 etc.) on your HTML document. knit2html() output is here.

The project page is:

Wednesday, July 24, 2013

Q&A forum for methylKit

methylKit is an R package for DNA methylation analysis and annotation using high-throughput bisulfite sequencing data. We recently activated a Q&A forum for methylKit related questions. If you are using methylKit and if you have questions about it,  it is a great place to ask questions and browse previously answered questions.

You can reach it via!forum/methylkit_discussion

Tuesday, March 19, 2013

EpiWorkshop 2013: DNA methylation analysis in R

Elemento Lab at Weill Cornell Medical College organized a workshop on Epigenomics. I had the opportunity to give a tutorial on DNA methylation analysis in R. The tutorial demonstrates how to analyze high-throughput bisulfite sequencing data ( RRBS is used as an example in the tutorial) with R package methylKit.

My slides are below and all of the workshop material is here at the workshop website.