Tuesday, March 19, 2013

EpiWorkshop 2013: DNA methylation analysis in R

Elemento Lab at Weill Cornell Medical College organized a workshop on Epigenomics. I had the opportunity to give a tutorial on DNA methylation analysis in R. The tutorial demonstrates how to analyze high-throughput bisulfite sequencing data ( RRBS is used as an example in the tutorial) with R package methylKit.

My slides are below and all of the workshop material is here at the workshop website.



Friday, March 1, 2013

Using ENCODE methylation data (RRBS) in R

ENCODE project has generated reduced-representation bilsulfite sequencing data for multiple cell lines. The data is organized in an extended bed format with additional columns denoting % methylation and coverage per base. Luckily, this sort of generic % methylation information can be read in by R package methylKit, which is a package for analyzing basepair resolution 5mC and 5hmC data.

The code snippets below show how to read RRBS bed file produced by ENCODE. But, let's first download the files.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/wgEncodeHaibMethylRrbsA549Dm002p7dHaibSitesRep1.bed.gz

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/wgEncodeHaibMethylRrbsAg04449UwstamgrowprotSitesRep1.bed9.gz

Unfortunately, methylKit currently can not read them directly because the track definition line causes a problem. It should be deleted from each bed file. Ideally, methylKit should be able to skip over lines (this issue should be fixed in later versions)
For now, we have to use some unix tools to remove the first lines from the bed files. You run the code below in your terminal. This set of commands will delete the first line from every *.gz file in the directory so be careful.
for files in *.gz
do
  gzip -dc "$files" | tail +2 | gzip -c > "$files".tmp
  if [ "$?" -eq 0 ]; then
    mv "$files".tmp "$files"
  fi
done

Now we can read the files using methylKit. The pipeline argument defines which columns in the file are corresponding to chr,start,end,strand, percent methylation and coverage:
You can also read multiple files at a time:

Since we have read the files and now they are methylKit objects, we can use all of the methylKit functionality on these objects. For example, the code below plots the distribution of methylation % for covered bases.

getMethylationStats(obj[[1]], plot = TRUE)


You can check the methylKit vignette and the website for the rest of the functionality and details.




Wednesday, February 20, 2013

Progress bar in R


A decent percentage of working time in R, I spend looping over chromosomes, transcription factors or tissues, usually, using parallelization.

To get the stuff to run simultaneously I use the foreach function from the doMC package, and for monitoring of the progress of the execution, I made use of the cat/print functions, which mostly just clutter the terminal.

Today a friend of mine set me a link to this blog, and I was dumbfounded when I read that the R base package has an inbuilt function for a nice looking progress bar. To my luck, it works perfectly when put inside a foreach loop.

library(doMC)
registerDoMC(5)
total <- 200
# create progress bar
pb <- txtProgressBar(min = 0, max = total, style = 3)
foreach(i = 1:total)%dopar%{
   Sys.sleep(0.1)
   setTxtProgressBar(pb, i)
}

close(pb)

Friday, December 14, 2012

Sending commands from Notepad++ to a remote R session

If you have your working environment set up in a Windows operating system, it can be a bit of a hassle to work with R sessions on remote Linux servers.

I use WinSCP + Notepad++ to handle my projects and Putty + screen to handle the R sessions. It becomes tiring to use the mouse to move the code from Notepad to the Putty all the time. Thankfully the amazing AutoHotkey comes to the rescue.

Using AutoHotkey it is possible to set up a hotkey macro which will copy the code that you want from the Notepad and paste it into Putty.

The following code does exactly that:

!q::
Send {End 2}+{Home}^c{End 2}{Down}{End 2}
WinActivate, root
Send +{Ins}{Enter}
SetTitleMatchMode, 2
WinActivate, Notepad
return


Let's break it apart.

!q:: - defines the hotkey as alt-q

Send {End 2}+{Home}^c{End 2}{Down}{End}
 - selects the current line of code and copies the stuff into clipboard. It does that by brute force manipulation of the pointer. Firstly it moves the pointer to the end of the current line, does a shift + home, which selects the text. ^c copies the text to the clipboard, and the last two commands put the pointer to the end of the next line of code. The reason for doing the End command twice is that the macro becomes confused if the word wrap is on.


WinActivate, root - the winactivate command selects any window by it's name. When I log onto my remote server the Putty window is named root, so basically this line just switches to the terminal.

Send +{Ins}{Enter} - pastes the line of code from the clipboard and executes the line

SetTitleMatchMode, 2 - this command regulates the pattern matching for the WinActivate command. 2 means that the pattern can be found anywhere in the name of the window. The default is 1 which means that pattern must be found at the beginning of the name of the window.

WinActivate, Notepad - switches us back to the Notepad 

return - ends the macro


For the code to work, you need to install the AutoHotkey, create a file that has the .ahk extension, copy the code into the file, and run it.

I hope this will save you some time!   


 

Tuesday, December 11, 2012

Sir John Gurdon Nobel lecture

Sir John Gurdon gave a phenomenal Nobel lecture on the current advancements in the nuclear transfer methodology.

Gurdon gives an overview of the main players, and their roles, in what he describes as the constant battle between the nucleus and the egg, and leaves us with a bit of a cliffhanger towards the end.

If you have 40 minutes to spare, it will be well worth your time.


Friday, October 5, 2012

How to read BSMAP methylation ratio files into R via methylKit

BSMAP is an aligner for bisulfite sequencing reads. It outputs aligned reads as well as methylation ratios per base (via methratio.py script). The methylation ratios can be read into R via methylKit package and regular methylKit analysis can be performed using the BSMAP data.

First,  you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner. If you want to analyze Cs from all contexts then of course you don't need to separate the file. For example, to get all Cs in the CpG context you will need to do something like following:

awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" &&  $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt

Here is how methylation ratio file from BSMAP looks like (file includes only Cs in CpG context):

chr  pos     strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 +      CGCGT   0.000 56      0       0.000    0.064
chr1 3121597 +      ATCGG   0.000 56      0       0.000    0.064
chr1 3121599 +      GTCGT   0.000 56      0       0.000    0.064
chr1 3121605 +      CTCGG   0.000 56      0       0.000    0.064
chr1 3121606 +      TGCGC   0.000 56      0       0.000    0.064
chr1 3121607 +      GGCGC   0.000 56      0       0.000    0.064
chr1 3121611 +      CTCGA   0.000 56      0       0.000    0.064
chr1 3121614 +      TACGC   0.000 56      0       0.000    0.064
chr1 3121631 +      CTCGT   0.000 56      0       0.000    0.064



You can read the methylation ratio file by using "pipeline" argument in read() function. You need to provide a list of column numbers corresponding to chr,start,end,strand,coverage and ratio of methylation. Actually, you can read any generic methylation ratio or percentage file using this method. The file needs to have the location information (chr,start,end and strand), coverage information and methylation percentage or ratio information.



Thanks to Maxime Caron for sharing the BSMAP methylation ratio file.


Reference for methylKit:
Altuna Akalin, Matthias Kormaksson, Sheng Li, Francine E. Garrett-Bakelman, Maria E. Figueroa, Ari Melnick, Christopher E. Mason.(2012)"methylKit: A comprehensive R package for the analysis of genome-wide DNA methylation profiles." Genome Biology , 13:R87.



PS: For the newer releases of BSMAP (~v2.73), you may need to coerce the CT counts to an integer. BSMAP ratio extractor may output effective CT counts as float but they should be coerced to integer before they are read in by methylKit. methylKit (>= v0.5.6) will automatically coerce those values to nearest integer. However, you can also use awk to coerce float to integer values. The example below coerces the 6th column of the the file to integer

awk '{OFS="\t";print $1,$2,$3,$4,$5,int($6);}' sample.BSMAP.txt



Thursday, September 27, 2012

Downloading Annotation Files for methylKit

DNA methylation analysis package methylKit can annotate differentially methylated bases/regions or bases/regions covered by reads.  The package supports input of annotation files in BED format. The annotation file can contain the location of the genes, CpG islands or any other genomic feature of interest. The package will read in the provided annotation file and produce a GRanges object (from GenomicRanges package) to be used in subsequent functions for annotating regions or bases of interest.

You can download annotation files from UCSC table browser for your genome of interest. Go to this webpage. On the top menu click on "tools" then "table browser". Select your "genome" of interest and "assembly" of interest from the drop down menus. Make sure you select the correct genome and assembly. Selecting wrong genome and/or assembly will return unintelligible results in downstream analysis.
From here on you can either download gene annotation or CpG island annotation.
  1. For gene annotation, select "Genes and Gene prediction tracks" from the "group" drop-down menu. Following that, select "Refseq Genes" from the "track" drop-down menu. Select "BED- browser extensible data" for the "output format". Click "get output" and on the following page click "get BED" without changing any options. save the output as a text file.
  2. For CpG island annotation, select "Regulation" from the "group" drop-down menu. Following that, select "CpG islands" from the "track" drop-down menu. Select "BED- browser extensible data" for the "output format". Click "get output" and on the following page click "get BED" without changing any options. save the output as a text file.
In addition, you can check this tutorial to learn how to download any track from UCSC in BED format http://www.openhelix.com/cgi/tutorialInfo.cgi?id=28