Thursday, October 15, 2015

New features in genomation package

Extending genomation to work with paired-end BAM files

Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.

Recently we added new features to genomation and here we present them on example of binding profiles of 6 transcription factors around the CTCF binding sites derived from ChIP-seq. All new functionalities are available in the latest version of genomation that can be found on its github website.

# install the package from github

Extending genomation to work with paired-end BAM files

Genomation can work with paired-end BAM files. Mates from reads are treated as fragments (they are stitched together).

genomationDataPath = system.file('extdata',package='genomationData')
bam.files = list.files(genomationDataPath, full.names=TRUE, pattern='bam$')
bam.files = bam.files[!grepl('Cage', bam.files)]

Accelerate functions responsible for reading genomic files

This is achived by using readr::read_delim function to read genomic files instead of read.table. Additionally if skip=“auto” argument is provided in readGeneric_or track.line=“auto” in other functions that read genomic files, e.g. _readBroadPeak then UCSC header is detected (and first track).


ctcf.peaks = readBroadPeak(file.path(genomationDataPath, 
ctcf.peaks = ctcf.peaks[seqnames(ctcf.peaks) == 'chr21']
ctcf.peaks = ctcf.peaks[order(-ctcf.peaks$signalValue)]
ctcf.peaks = resize(ctcf.peaks, width=1000, fix='center')

Parallelizing data processing in ScoreMatrixList

We use ScoreMatrixList function to extract coverage values of all transcription factors around ChIP-seq peaks. ScoreMatrixList was improved by adding new argument cores that indicates number of cores to be used at the same time (by using parallel:mclapply).

sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num=50, type='bam', cores=2)

# descriptions of file that contain info. about transcription factors
sampleInfo = read.table(system.file('extdata/SamplesInfo.txt',
                                    package='genomationData'),header=TRUE, sep='\t')
names(sml) = sampleInfo$sampleName[match(names(sml),sampleInfo$fileName)]

Arithmetic, indicator and logic operations as well as subsetting work on score matrices

Arithmetic, indicator and logic operations work on ScoreMatrix, ScoreMatrixBin and ScoreMatrixList
objects, e.i.:
Arith: “+”, “-”, “*”, “”, “%%”, “%/%”, “/”
Compare: “==”, “>”, “<”, “!=”, “<=”, “>=”
Logic: “&”, “|”

sml1 = sml * 100
## scoreMatrixlist of length:5
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50


sml[[6]] = sml[[1]]
## scoreMatrixlist of length:6
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50
## 6. scoreMatrix with dims: 1681 50
sml[[6]] <- NULL

Improvements and new arguments in visualization functions

Due to large signal scale of rows of each element in the ScoreMatrixList we scale them.

sml.scaled = scaleScoreMatrixList(sml)

Faster heatmaps

HeatMatrix and multiHeatMatrix function works faster by faster assigning colors. Heatmap profile of scaled coverage shows a colocalization of Ctcf, Rad21 and Znf143.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500))

plot of chunk unnamed-chunk-8

New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix

clustfun allow to add more clustering functions and integrate them with the heatmap function multiHeatMatrix. It has to be a function that returns a vector of integers indicating the cluster to which each point is allocated. Previous version of multiHeatMatrix could cluster rows of heatmaps using only k-means algorithm.

# k-means algorithm with 2 clusters
cl1 <- function(x) kmeans(x, centers=2)$cluster
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1)

plot of chunk unnamed-chunk-9

# hierarchical clustering with Ward's method for agglomeration into 2 clusters
cl2 <- function(x) cutree(hclust(dist(x), method="ward"), k=2)
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl2)
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot of chunk unnamed-chunk-9

Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix

clust.matrix argument indicates which matrices are used for clustering. It can be a numerical vector of indexes of matrices or a character vector of names of the ‘ScoreMatrix’ objects in 'ScoreMatrixList'. Matrices that are not in clust.matrix are ordered according to the result of the clustering algorithm. By default all matrices are clustered.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1, clust.matrix = 1)

plot of chunk unnamed-chunk-10

Central tendencies in line plots: centralTend in plotMeta

We extended visualization capabilities for meta-plots. plotMeta function can plot not only mean, but also median as central tendency and it can be set up using centralTend argument. Previously user could plot only mean.

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
         xcoords=c(-500, 500),

plot of chunk unnamed-chunk-11

Smoothing central tendency: smoothfun in plotMeta

We added smoothfun argument to smooth central tendency as well as dispersion bands around it which is shown in the next figure. Smoothfun has to be a function that returns a list that contains a vector of y coordinates (vector named '$y').

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
         xcoords=c(-500, 500),
         smoothfun=function(x) stats::smooth.spline(x, spar=0.5))

plot of chunk unnamed-chunk-12

Plotting dispersion around central lines in line plots: dispersion in plotMeta

We added new argument dispersion to plotMeta that shows dispersion bands around centralTend. It can take one of the arguments:

  • “se” shows standard error of the mean and 95 percent confidence interval for the mean
  • “sd” shows standard deviation and 2*(standard deviation)
  • “IQR” shows 1st and 3rd quartile and confidence interval around the median based on the median +/- 1.57 * IQR/sqrt(n) (notches)
plotMeta(mat=sml, profile.names=names(sml),
         xcoords=c(-500, 500),
         smoothfun=function(x) stats::smooth.spline(x, spar=0.5),
         dispersion="se", lwd=4)

plot of chunk unnamed-chunk-13

Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function

We added new function patternMatrix that calculates k-mer and PWM occurrences over predefined equal width windows. If one pattern (character of length 1 or PWM matrix) is given then it returns ScoreMatrix, if more than one character ot list of PWM matrices then ScoreMatrixList. It finds either positions of pattern hits above a specified threshold and creates score matrix filled with 1 (presence of pattern) and 0 (its absence) or matrix with score themselves. windows can be a DNAStringList object or GRanges object (but then genome argument has to be provided, a BSgenome object).

#ctcf motif from the JASPAR database
ctcf.pfm = matrix(as.integer(c(87,167,281,56,8,744,40,107,851,5,333,54,12,56,104,372,82,117,402, 
rownames(ctcf.pfm) <- c("A","C","G","T")

prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25)
priorProbs = prior.params/sum(prior.params)
postProbs = t( t(ctcf.pfm + prior.params)/(colSums(ctcf.pfm)+sum(prior.params)) )
ctcf.pwm = Biostrings::unitScale(log2(postProbs/priorProbs))

hg19 = BSgenome.Hsapiens.UCSC.hg19

p = patternMatrix(pattern=ctcf.pwm, windows=ctcf.peaks, genome=hg19, min.score=0.8)

Visualization of the patternMatrix patternMatrix (here as ScoreMatrix object) can be visualized using i.e. heatMatrix, heatMeta or plotMeta functions.

heatMatrix(p, xcoords=c(-500, 500), main="CTCF motif")

plot of chunk unnamed-chunk-15

plotMeta(mat=p, xcoords=c(-500, 500), smoothfun=function(x) stats::lowess(x, f = 1/10), 
         line.col="red", main="ctcf motif")

plot of chunk unnamed-chunk-15

Integration with Travis CI for auto-testing

Recently we integrated genomation with Travis CI. It allows users to see current status of the package which is updated during every change of the package. Travis automatically runs R CMD CHECK and reports it. Shields shown below are on the genomation github site:
Status Build Status BioC_years BioC_availability

# <br />
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3                  
##  [3] rtracklayer_1.28.10               Biostrings_2.36.4                
##  [5] XVector_0.8.0                     GenomicRanges_1.20.8             
##  [7] GenomeInfoDb_1.4.3                IRanges_2.2.9                    
##  [9] S4Vectors_0.6.6                   BiocGenerics_0.14.0              
## [11] genomation_1.1.27                 BiocInstaller_1.18.5             
## [13] devtools_1.9.1                   
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1             formatR_1.2.1          
##  [3] futile.logger_1.4.1     plyr_1.8.3             
##  [5] bitops_1.0-6            futile.options_1.0.0   
##  [7] tools_3.2.2             zlibbioc_1.14.0        
##  [9] digest_0.6.8            gridBase_0.4-7         
## [11] evaluate_0.8            memoise_0.2.1          
## [13] gtable_0.1.2            curl_0.9.3             
## [15] yaml_2.1.13             proto_0.3-10           
## [17] httr_1.0.0              stringr_1.0.0          
## [19] knitr_1.11              data.table_1.9.6       
## [21] impute_1.42.0           R6_2.1.1               
## [23] plotrix_3.5-12          XML_3.98-1.3           
## [25] BiocParallel_1.2.22     seqPattern_1.0.1       
## [27] rmarkdown_0.8.1         readr_0.1.1            
## [29] reshape2_1.4.1          ggplot2_1.0.1          
## [31] lambda.r_1.1.7          magrittr_1.5           
## [33] matrixStats_0.14.2      MASS_7.3-44            
## [35] scales_0.3.0            Rsamtools_1.20.5       
## [37] htmltools_0.2.6         GenomicAlignments_1.4.2
## [39] colorspace_1.2-6        KernSmooth_2.23-15     
## [41] stringi_0.5-5           munsell_0.4.2          
## [43] RCurl_1.95-4.7          chron_2.3-47           
## [45] markdown_0.7.7

Monday, July 6, 2015

GNU Guix for easily managing bioinformatics software ( or any other software )

We are using GNU Guix to manage bioinformatics software that are installed on our HPC and also on the individual machines. By now, there are 54 bionformatics related packages in Guix repository, and the number is growing. The list includes many NGS related software such as samtools, bowtie, STAR, sra-tools etc. The full list is here:

Why use Guix ?
GNU Guix is a package managing tool, that can build from source with dependencies, set up user environments and can upgrade/downgrade installed packages.

- Easy to install and maintain packaged software. One line on the terminal can update or install a package.
- Guix creates an isolated yet shared environment. Each piece of software is installed in /gnu/store and does not alter global state and the user profiles link to this directory.
- User profiles are independent from each other. You can even create a specific user profile per project and this helps with reproducibility.
- Guix can be used in an HPC environment or individual machines connected to the same network (described here)

How to install software using Guix
If you have GNU Guix set up, installing pre-packaged software is super easy. And we have most of the popular tools already packaged. For example, below is how to install samtools.

guix package -i samtools

Setting up GNU Guix is also easy. Just follow the instructions provided by Ricardo Wurmus (who packaged most of the bioinformatics software for Guix by the way)

How to contribute Guix and/or Guix bioinformatics packages.
The suggested way to contribute to upstream is to use a git checkout, make changes there, create a patch with git format-patch and send that to (
There's also a contribution page:
Luckily, getting started with development is relatively simple:
  • git clone git://
  • if you already have guix do guix environment guix to load up everything you need to build guix from source
  • ./bootstrap./configure (point --localstatedir to /var to make it use the same state as your binary installation of guix), make
  • edit package modules in gnu/packages/
  • run ./pre-inst-env guix lint to find common problems (pre-inst-env makes guix use the packages in the local directory)
  • build the new package with ./pre-inst-env guix build my-package
  • if everything works fine: commit and create a patch with git format-patch -1
  • send the file to the mailing list for review

Sunday, July 5, 2015

Hands-on computational genomics course in Berlin

Berlin Institute for Medical Systems Biology is organizing a computational genomics course and R programming will be used for most practical sessions. The course will cover basic statistics, programming and basic concepts in next-generation sequencing as well as it is applications such as RNA-seq and ChIP-seq. There will be practical sessions every day. The course will last 7 days and will be taught at MDC-Berlin campus between 18-24 October 2015.

Course Modules

The course modules are as follows. Almost all modules have practical sessions and participants will have a chance to directly apply what they learned.
- Introduction to R and Bioconductor
- Statistics and Exploratory Data analysis
- Introduction to Next-gen sequencing
- Applications of computational genomics
- RNA-seq analysis
- ChIP-seq analysis
- Data integration and visualization

The instructors

The instructors include high-profile local and external scientists working on computational genomics.
See the full list at:

Apply by 30 July 2015

see instructions at:

Event poster

Friday, June 19, 2015

Segmentation of methylation profiles using methylKit

methylKit is an R package for DNA methylation analysis and annotation using high-throughput bisulfite sequencing data. We recently included a new function in methylKit called methSeg(). The function can segment methylation profiles (methylRaw objects) or differential methylation profiles (methylDiff objects). Currently the function is available on a different branch for development. You can install the version of methylKit that has the segmentation functionality as shown below. We will move it to the master branch once we have more feedback for the function.


Regulatory regions can be discovered by analyzing methylation patterns. One of the popular methods is to segment methylomes into distinct categories such as regions with low methylation or high methylation. The most popular method for categorical methylome segmentation is based on hidden Markov models (HMM) (an example is here). Another segmentation strategy is to use variants of change-point analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copy-number variation) detection but have applications in methylation context as well.

In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. The segment groups correspond to highly methylated regions or lowly methylated regions. Since this is an unsupervised method, we can find more patterns than just highly or lowly methylated regions. The users have to further overlap the segments with other marks and annotation such as promoters, H3K27ac or other histone modifications to associate segment groups with functional annotation.

Use case and example

Below, we are using methylation data from Roadmap Epigenomics H1 cell line. The code demonstrates how to use the function and export the results to a BED file. Similarly, we can apply this function on differential methylation profiles, where values range between -100 and 100. Do help(methSeg) to see the help page for the function.

Monday, April 13, 2015

Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges (or IRanges if you don't need to compare intervals with an associated chromosome and strand). foverlaps is from the data.table package and is inspired by findOvelaps.

In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.

In the foverlaps function Y has to be indexed using the setkey function (we don't have to do it on X). The key is intended to speed-up finding overlaps.

Which one is faster?

To check this we used the benchmark function from the rbenchmark package. It's a simple wrapper of the system.time function.

The code below plots the execution time of both functions for increasing numbers of rows of data set X.

Interestingly,  foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.

We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.

Information about my R session:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.4 (Mavericks)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] data.table_1.9.4     rbenchmark_1.0.0     GenomicRanges_1.18.4
[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        S4Vectors_0.4.0     
[7] BiocGenerics_0.12.1 

loaded via a namespace (and not attached):
[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 
[6] tools_3.1.3    XVector_0.6.0

Tuesday, March 31, 2015

Using genomation to analyze methylation profiles from Roadmap epigenomics and ENCODE

The genomation package is a toolkit for annotation and visualization of various genomic data. The package is currently also in BioC. It allows to analyze high-throughput data, including bisulfite sequencing data. Here, we will visualize the distribution of CpG methylation around promoters and their locations within gene structures on human chromosome 3.

Heatmap and plot of meta-profiles of CpG methylation around promoters

In this example we use data from Reduced Representation Bisulfite Sequencing (RRBS) and
Whole-genome Bisulfite Sequencing (WGBS) techniques and H1 and IMR90 cell types
derived from the ENCODE and the Roadmap Epigenomics Project databases.

We download the datasets and convert them to GRanges objects. Using rtracklayer and genomation functions. We also use a refseq bed file for annotation and extraction of promoter regions using readTranscriptFeatures function.

Since we have read the files now we can build base-pair resolution matrices of scores(methylation values) for each experiment. The returned list of matrices can be used to draw heatmaps or meta profiles of methylation ratio around promoters.

Distribution of covered CpGs across gene regions

genomation facilitates visualization of given locations of features aggregated by  exons, introns, promoters and TSSs. To find the distribution of covered CpGs within these gene structures, we will use transcript features we previously obtained. Here is the breakdown of the code

  1. Count overlap statistics between our CpGs from WGBS and RRBS H1 cell type and gene structures
  2. Calculate percentage of CpGs overlapping with annotation
  3. plot them in a form of pie charts

Tuesday, February 18, 2014

summary, annotation and visualization of genomic intervals with genomation package

We have been working on a new R package, genomation, that will help with the analysis of genomic intervals. Briefly, the package contains a collection of tools for visualizing and analyzing genome-wide data sets. The package works with a variety of genomic interval file types (BAM, bigWig, BED, GFF and any tabular format containing genomic locations) and enables easy summarization and annotation of high throughput data sets with given genomic annotations. Below is a presentation I'm preparing for an upcoming internal meeting. You can find more information and the vignette on the webpage.

genomation is published in Bioinformatics
Akalin A, Franke V, Vlahoviček K, Mason CE, Schübeler D. genomation: a toolkit to summarize, annotate and visualize genomic intervals. Bioinformatics. 2014 Nov 21. pii: btu775

It is also currently in the Bioconductor development version