Wednesday, February 10, 2021

New Book: "Computational Genomics with R"


It costs now 299$ now to sequence your genome []. It is getting cheaper to generate genomic data and more of it is coming through. We can't see the same trend for the people who can analyze the data. There is always a shortage of people who can do decent analysis of genomic data. We need to train these people. However, the skillset required ,apart from the domain specific knowledge of genomics, is very similar to data science jobs. Some qualified people will go on those route and some people who are trained in genomic data analysis will end up as data scientists in unrelated fields. This pushes the demand for genomic data analysis people higher and higher. Not only there is not enough of them, the ones we have might do something else using the same skillset.

Because of this scarcity but also to decrease our workload in the lab we chose to train people in computational genomics. We have done this for many years. We organized public hands-on courses and provide consultation and mentoring opportunities for people who want to learn how to do genomic data analysis. We see that there is a large demand for genomic data analysis and we can teach those skills.

Now, we organized this material as a book and published via CRC press.

Who is this book for?

The book contains practical and theoretical aspects of computational genomics. Biology and medicine generate more data than ever before. Therefore, we need to educate more people with data analysis skills and understanding of computational genomics. Since computational genomics is interdisciplinary, this book aims to be accessible for biologists, medical scientists, computer scientists and people from other quantitative backgrounds. We wrote this book for the following audiences:

  • Biologists and medical scientists who generate the data and are keen on analyzing it themselves.
  • Students and researchers who are formally starting to do research on or using computational genomics do not have extensive domain-specific knowledge, but have at least a beginner-level understanding in a quantitative field, for example, math, stats.
  • Experienced researchers looking for recipes or quick how-to’s to get started in specific data analysis tasks related to computational genomics.

What will you get out of this?

This resource describes the skills and provides how-to’s that will help readers analyze their own genomics data.

After reading:

  • If you are not familiar with R, you will get the basics of R and dive right in to specialized uses of R for computational genomics.
  • You will understand genomic intervals and operations on them, such as overlap.
  • You will be able to use R and its vast package library to do sequence analysis, such as calculating GC content for given segments of a genome or find transcription factor binding sites.
  • You will be familiar with visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization.
  • You will be familiar with supervised and unsupervised learning techniques which are important in data modeling and exploratory analysis of high-dimensional data.
  • You will be familiar with analysis of different high-throughput sequencing datasets (RNA-seq, ChIP-seq, BS-seq and multi-omics integration) mostly using R-based tools.

Wednesday, January 17, 2018

Computational genomics course in Berlin: Bisulfite-sequencing and data integration

Bioinformatics platform at Berlin Institute for Medical Systems Biology has been organizing computational genomics courses in Berlin since 2015. This year we will focus on bisulfite sequencing analysis and genomics data integration.

The course will cover the following:
1) bisulfite-seq analysis (alignment, QC, differential methylation, segmentation)
2) integrating and visualizing processed genomics data. These methods can be used for any genomics data set not necessarily BS-seq data. 

There will be theoretical lectures followed by practical sessions where students directly apply what they have learned. The programming will be mainly done in R and unix shell.

The deadline for application is March 1st! 

You can get more info at

Monday, October 31, 2016

Poster/cheatsheet for R/BioC package genomation

We prepared a poster/cheatsheet for the bioconductor package genomation, which is a package for summary and annotation of genomic intervals. Users can visualize and quantify genomic intervals over pre-defined functional regions, such as promoters, exons, introns, etc. The genomic intervals represent regions with a defined chromosome position, which may be associated with a score, such as aligned reads from HT-seq experiments, TF binding sites, methylation scores, etc. The package can use any tabular genomic feature data as long as it has minimal information on the locations of genomic intervals. In addition, It can use BAM or BigWig files as input. [download from slideshare for a better resolution]

Friday, August 12, 2016

Annotating sets of genomic intervals with genomic annotations such as chromHMM

Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.

Recently we added new features to genomation The new functionalities are available in the latest version of genomation that can be found on its github website.

This demo shows the new annotation functions in genomation. The functions can be used to annotate target regions or a list of target regions with a given set of genomic features. The genomic features to be used should be in named GRangesList format.

Get data to R

We will get p300, SP1 and Nanog peaks,and chromHMM annotations from ENCODE. We wil use the H1-Esc cells. The aim is to annotate the peaks with chromHMM annotations.

The new functions

here are the new functions annotateWithFeatures and heatTargetAnnotation. These functions are available as of version 1.5.6. There were similar functions within the package before but these are more generalized versions of the old ones.

annotateWithFeatures() will calculate the percentage of overlaps in a given GRanges object with a GRangesList object where each element correspods to a different category of genomic features such as promoters, exons and introns.

heatTargetAnnotation() will plot a heatmap of annotations returned from annotateWithFeatures() or can return the matrix that is used the create the heatmap.

Annotate peaks with chromHMM segments

We can annote a given peak set like this:

## summary of target set annotation with feature annotation:
## Rows in target set: 8934
## ----------------------------
## percentage of target elements overlapping with features:
## 1_Active_Promoter 10_Txn_Elongation       11_Weak_Txn      12_Repressed 
##             28.16              0.07              2.10              1.33 
## 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV   2_Weak_Promoter 
##              2.74              0.08              0.18             23.86 
## 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer   6_Weak_Enhancer 
##              8.69             11.91             16.33             32.65 
##   7_Weak_Enhancer       8_Insulator  9_Txn_Transition 
##              8.03              5.72              1.00
## percentage of feature elements overlapping with target:
## 1_Active_Promoter 10_Txn_Elongation       11_Weak_Txn      12_Repressed 
##             19.81              0.04              0.16              0.68 
## 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV   2_Weak_Promoter 
##              0.27              0.19              0.57              6.76 
## 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer   6_Weak_Enhancer 
##              6.45             20.36             11.49              3.46 
##   7_Weak_Enhancer       8_Insulator  9_Txn_Transition 
##              0.53              0.84              0.71

We can annotate the GRangesList of different peak sets like this:

## Working on: p300
## Working on: SP1
## Working on: NANOG

Make a heatmap of percentage of peaks overlapping with different chromHMM annotations

We can make a heatmap using heatTargetAnnotation() function. The function also returns a ggplot2 object, so it can be further manipulated via ggplot2.

We can also get the matrix that is used to make the heatmap and use it in other heatmap functions.


## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.5 (El Capitan)
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## other attached packages:
## [1] devtools_1.11.1  gplots_3.0.1     genomation_1.5.6
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6                formatR_1.4               
##  [3] GenomeInfoDb_1.8.3         plyr_1.8.4                
##  [5] XVector_0.12.1             bitops_1.0-6              
##  [7] tools_3.3.0                zlibbioc_1.18.0           
##  [9] digest_0.6.10              memoise_1.0.0             
## [11] tibble_1.1                 evaluate_0.9              
## [13] gtable_0.2.0               BSgenome_1.40.1           
## [15] gridBase_0.4-7             yaml_2.1.13               
## [17] parallel_3.3.0             withr_1.0.1               
## [19] rtracklayer_1.32.2         stringr_1.0.0             
## [21] knitr_1.13.1               caTools_1.17.1            
## [23] gtools_3.5.0               Biostrings_2.40.2         
## [25] S4Vectors_0.10.2           IRanges_2.6.1             
## [27] stats4_3.3.0               Biobase_2.32.0            
## [29] data.table_1.9.6           impute_1.46.0             
## [31] plotrix_3.6-3              XML_3.98-1.4              
## [33] BiocParallel_1.6.5         seqPattern_1.4.0          
## [35] rmarkdown_0.9.6            gdata_2.17.0              
## [37] reshape2_1.4.1             readr_1.0.0               
## [39] ggplot2_2.1.0              magrittr_1.5              
## [41] codetools_0.2-14           matrixStats_0.50.2        
## [43] Rsamtools_1.24.0           scales_0.4.0              
## [45] htmltools_0.3.5            BiocGenerics_0.18.0       
## [47] GenomicRanges_1.24.2       GenomicAlignments_1.8.4   
## [49] assertthat_0.1             SummarizedExperiment_1.2.3
## [51] colorspace_1.2-6           labeling_0.3              
## [53] KernSmooth_2.23-15         stringi_1.1.1             
## [55] RCurl_1.95-4.8             munsell_0.4.3             
## [57] chron_2.3-47

Thursday, June 2, 2016

methylKit v0.9.6

We released a new version of methylKit, which is a package for DNA methylation analysis with bisulfite-seq data. This version comes with many changes summarized below. you can also have a look at the release notes. The vignette is now converted to HTML and also lives on rpubs.

Working with large and/or many files

This turns out to be more and more of a problem. When we first designed methylKit it was tested on RRBS data with couple of million CpGs covered. These days many people have either many RRBS or similar reduced representation sequencing results or a fair amount whole-genome bisulfite sequencing. Due to their size, it may be difficult to work with these kind of samples in-memory. We developed a flatfile database based on "tabix"( This means large objects are written out as tabix files, they are compressed and indexed. We can work with them chunk-by-chunk in a parallelized manner. However, that comes with a cost. It will take time to initially create tabix files but you will be able to work with large data sets without so much of the memory bottleneck.

C/C++ based parsing of Bismark SAM/BAM files

We used to parse sorted Bismark SAM files with a perl script. Now, that script is converted to C/C++ using samtools libraries and Rcpp in the background. This made parsing faster also got rid of a system dependency. processBismarkAln() function is the one where this functionality is implemented. It can parse sorted BAM/SAM files.

More differential methylation tests and dealing with covariates

We added more differential methylation tests. Main improvement is that we can adjust for overdispersion within the logistic regression model. In addition, we can also add covariates such as age and sex into the model and take them into account when testing for differential methylation. We also added a beta-binomial distribution based test by wrapping the functions in the bioconductor DSS package.

Methylation profile segmentation 

Regulatory regions can be discovered by analyzing the methylation patterns, segments with low methylation could indicate a regulatory region. We used a segmentation strategy based on change-point analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copy-number variation) detection but could have applications in methylation context as well.  In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. This functionality is also mentioned here in this blog before, now it is a part of the current version.

Reading processed Bismark files 

Bismark aligner scripts can produce per base methylation files. These files, namely "cytosine report"  and "coverage" files can be read in R using methylKit methRead() function. See Bismark manual for details on the files.

Function name changes 

The following function names are changed for package wide naming consistency. The old names will continue to work for a while.
    - read() to methRead()
    - read.bismark to processBismarkAln()
    - adjust.methylC() to adjustMethylC()
    - get.methylDiff() to getMethylDiff()
    - annotate.WithFeature() to annotateWithFeature()
    - annotate.WithFeature.Flank() to annotateWithFeatureFlank()
    - annotate.WithGenicParts() to annotateWithGenicParts()
    - read.bed() to readBed()
    - read.feature.flank() to readFeatureFlank()
    - read.transcript.features() to readTranscriptFeatures()

Monday, May 23, 2016

Computational Genomics Course in Berlin

Berlin Institute for Medical Systems Biology is organizing a computational genomics course and R programming will be used for most of the practical sessions. The course will cover basic statistics, programming and basic concepts in next-generation sequencing as well as it is applications such as RNA-seq, ChIP-seq, DNA-seq and metagenomics in the context of precision medicine. There will be practical sessions every day. The course will last 10 work days and will be taught at MDC-Berlin campus between 12-23 September 2015.

Course Modules

  • Introduction to R and Bioconductor
  • Statistics and Exploratory Data analysis
  • Introduction to Next-gen sequencing
  • RNA-seq analysis
  • ChIP-seq analysis
  • Variant calling and annotation
  • Data integration and predictive modeling using HT-seq data
  • metagenomics and human health


The instructors include high-profile local and external scientists working on computational genomics.
See the full list at:

Travel Allowance 

We offer a travel allowance for covering the travel and the accommodation (for doublerooms in downtown Berlin) for successful candidates.

Apply By 01 July 2016

Event poster

Thursday, October 15, 2015

New features in genomation package

Extending genomation to work with paired-end BAM files

Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.

Recently we added new features to genomation and here we present them on example of binding profiles of 6 transcription factors around the CTCF binding sites derived from ChIP-seq. All new functionalities are available in the latest version of genomation that can be found on its github website.

# install the package from github

Extending genomation to work with paired-end BAM files

Genomation can work with paired-end BAM files. Mates from reads are treated as fragments (they are stitched together).

genomationDataPath = system.file('extdata',package='genomationData')
bam.files = list.files(genomationDataPath, full.names=TRUE, pattern='bam$')
bam.files = bam.files[!grepl('Cage', bam.files)]

Accelerate functions responsible for reading genomic files

This is achived by using readr::read_delim function to read genomic files instead of read.table. Additionally if skip=“auto” argument is provided in readGeneric_or track.line=“auto” in other functions that read genomic files, e.g. _readBroadPeak then UCSC header is detected (and first track).


ctcf.peaks = readBroadPeak(file.path(genomationDataPath, 
ctcf.peaks = ctcf.peaks[seqnames(ctcf.peaks) == 'chr21']
ctcf.peaks = ctcf.peaks[order(-ctcf.peaks$signalValue)]
ctcf.peaks = resize(ctcf.peaks, width=1000, fix='center')

Parallelizing data processing in ScoreMatrixList

We use ScoreMatrixList function to extract coverage values of all transcription factors around ChIP-seq peaks. ScoreMatrixList was improved by adding new argument cores that indicates number of cores to be used at the same time (by using parallel:mclapply).

sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num=50, type='bam', cores=2)

# descriptions of file that contain info. about transcription factors
sampleInfo = read.table(system.file('extdata/SamplesInfo.txt',
                                    package='genomationData'),header=TRUE, sep='\t')
names(sml) = sampleInfo$sampleName[match(names(sml),sampleInfo$fileName)]

Arithmetic, indicator and logic operations as well as subsetting work on score matrices

Arithmetic, indicator and logic operations work on ScoreMatrix, ScoreMatrixBin and ScoreMatrixList
objects, e.i.:
Arith: “+”, “-”, “*”, “”, “%%”, “%/%”, “/”
Compare: “==”, “>”, “<”, “!=”, “<=”, “>=”
Logic: “&”, “|”

sml1 = sml * 100
## scoreMatrixlist of length:5
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50


sml[[6]] = sml[[1]]
## scoreMatrixlist of length:6
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50
## 6. scoreMatrix with dims: 1681 50
sml[[6]] <- NULL

Improvements and new arguments in visualization functions

Due to large signal scale of rows of each element in the ScoreMatrixList we scale them.

sml.scaled = scaleScoreMatrixList(sml)

Faster heatmaps

HeatMatrix and multiHeatMatrix function works faster by faster assigning colors. Heatmap profile of scaled coverage shows a colocalization of Ctcf, Rad21 and Znf143.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500))

plot of chunk unnamed-chunk-8

New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix

clustfun allow to add more clustering functions and integrate them with the heatmap function multiHeatMatrix. It has to be a function that returns a vector of integers indicating the cluster to which each point is allocated. Previous version of multiHeatMatrix could cluster rows of heatmaps using only k-means algorithm.

# k-means algorithm with 2 clusters
cl1 <- function(x) kmeans(x, centers=2)$cluster
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1)

plot of chunk unnamed-chunk-9

# hierarchical clustering with Ward's method for agglomeration into 2 clusters
cl2 <- function(x) cutree(hclust(dist(x), method="ward"), k=2)
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl2)
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot of chunk unnamed-chunk-9

Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix

clust.matrix argument indicates which matrices are used for clustering. It can be a numerical vector of indexes of matrices or a character vector of names of the ‘ScoreMatrix’ objects in 'ScoreMatrixList'. Matrices that are not in clust.matrix are ordered according to the result of the clustering algorithm. By default all matrices are clustered.

multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1, clust.matrix = 1)

plot of chunk unnamed-chunk-10

Central tendencies in line plots: centralTend in plotMeta

We extended visualization capabilities for meta-plots. plotMeta function can plot not only mean, but also median as central tendency and it can be set up using centralTend argument. Previously user could plot only mean.

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
         xcoords=c(-500, 500),

plot of chunk unnamed-chunk-11

Smoothing central tendency: smoothfun in plotMeta

We added smoothfun argument to smooth central tendency as well as dispersion bands around it which is shown in the next figure. Smoothfun has to be a function that returns a list that contains a vector of y coordinates (vector named '$y').

plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
         xcoords=c(-500, 500),
         smoothfun=function(x) stats::smooth.spline(x, spar=0.5))

plot of chunk unnamed-chunk-12

Plotting dispersion around central lines in line plots: dispersion in plotMeta

We added new argument dispersion to plotMeta that shows dispersion bands around centralTend. It can take one of the arguments:

  • “se” shows standard error of the mean and 95 percent confidence interval for the mean
  • “sd” shows standard deviation and 2*(standard deviation)
  • “IQR” shows 1st and 3rd quartile and confidence interval around the median based on the median +/- 1.57 * IQR/sqrt(n) (notches)
plotMeta(mat=sml, profile.names=names(sml),
         xcoords=c(-500, 500),
         smoothfun=function(x) stats::smooth.spline(x, spar=0.5),
         dispersion="se", lwd=4)

plot of chunk unnamed-chunk-13

Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function

We added new function patternMatrix that calculates k-mer and PWM occurrences over predefined equal width windows. If one pattern (character of length 1 or PWM matrix) is given then it returns ScoreMatrix, if more than one character ot list of PWM matrices then ScoreMatrixList. It finds either positions of pattern hits above a specified threshold and creates score matrix filled with 1 (presence of pattern) and 0 (its absence) or matrix with score themselves. windows can be a DNAStringList object or GRanges object (but then genome argument has to be provided, a BSgenome object).

#ctcf motif from the JASPAR database
ctcf.pfm = matrix(as.integer(c(87,167,281,56,8,744,40,107,851,5,333,54,12,56,104,372,82,117,402, 
rownames(ctcf.pfm) <- c("A","C","G","T")

prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25)
priorProbs = prior.params/sum(prior.params)
postProbs = t( t(ctcf.pfm + prior.params)/(colSums(ctcf.pfm)+sum(prior.params)) )
ctcf.pwm = Biostrings::unitScale(log2(postProbs/priorProbs))

hg19 = BSgenome.Hsapiens.UCSC.hg19

p = patternMatrix(pattern=ctcf.pwm, windows=ctcf.peaks, genome=hg19, min.score=0.8)

Visualization of the patternMatrix patternMatrix (here as ScoreMatrix object) can be visualized using i.e. heatMatrix, heatMeta or plotMeta functions.

heatMatrix(p, xcoords=c(-500, 500), main="CTCF motif")

plot of chunk unnamed-chunk-15

plotMeta(mat=p, xcoords=c(-500, 500), smoothfun=function(x) stats::lowess(x, f = 1/10), 
         line.col="red", main="ctcf motif")

plot of chunk unnamed-chunk-15

Integration with Travis CI for auto-testing

Recently we integrated genomation with Travis CI. It allows users to see current status of the package which is updated during every change of the package. Travis automatically runs R CMD CHECK and reports it. Shields shown below are on the genomation github site:
Status Build Status BioC_years BioC_availability

# <br />
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3                  
##  [3] rtracklayer_1.28.10               Biostrings_2.36.4                
##  [5] XVector_0.8.0                     GenomicRanges_1.20.8             
##  [7] GenomeInfoDb_1.4.3                IRanges_2.2.9                    
##  [9] S4Vectors_0.6.6                   BiocGenerics_0.14.0              
## [11] genomation_1.1.27                 BiocInstaller_1.18.5             
## [13] devtools_1.9.1                   
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1             formatR_1.2.1          
##  [3] futile.logger_1.4.1     plyr_1.8.3             
##  [5] bitops_1.0-6            futile.options_1.0.0   
##  [7] tools_3.2.2             zlibbioc_1.14.0        
##  [9] digest_0.6.8            gridBase_0.4-7         
## [11] evaluate_0.8            memoise_0.2.1          
## [13] gtable_0.1.2            curl_0.9.3             
## [15] yaml_2.1.13             proto_0.3-10           
## [17] httr_1.0.0              stringr_1.0.0          
## [19] knitr_1.11              data.table_1.9.6       
## [21] impute_1.42.0           R6_2.1.1               
## [23] plotrix_3.5-12          XML_3.98-1.3           
## [25] BiocParallel_1.2.22     seqPattern_1.0.1       
## [27] rmarkdown_0.8.1         readr_0.1.1            
## [29] reshape2_1.4.1          ggplot2_1.0.1          
## [31] lambda.r_1.1.7          magrittr_1.5           
## [33] matrixStats_0.14.2      MASS_7.3-44            
## [35] scales_0.3.0            Rsamtools_1.20.5       
## [37] htmltools_0.2.6         GenomicAlignments_1.4.2
## [39] colorspace_1.2-6        KernSmooth_2.23-15     
## [41] stringi_0.5-5           munsell_0.4.2          
## [43] RCurl_1.95-4.7          chron_2.3-47           
## [45] markdown_0.7.7