Monday, October 31, 2016
Poster/cheatsheet for R/BioC package genomation
Friday, August 12, 2016
Annotating sets of genomic intervals with genomic annotations such as chromHMM
Annotating sets of genomic intervals with genomic annotations such as chromHMM
Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.
Recently we added new features to genomation The new functionalities are available in the latest version of genomation that can be found on its github website.
This demo shows the new annotation functions in genomation. The functions can be used to annotate target regions or a list of target regions with a given set of genomic features. The genomic features to be used should be in named GRangesList format.
Get data to R
We will get p300, SP1 and Nanog peaks,and chromHMM annotations from ENCODE. We wil use the H1-Esc cells. The aim is to annotate the peaks with chromHMM annotations.
The new functions
here are the new functions annotateWithFeatures and heatTargetAnnotation. These functions are available as of version 1.5.6. There were similar functions within the package before but these are more generalized versions of the old ones.
annotateWithFeatures() will calculate the percentage of overlaps in a given GRanges object with a GRangesList object where each element correspods to a different category of genomic features such as promoters, exons and introns.
heatTargetAnnotation() will plot a heatmap of annotations returned from annotateWithFeatures() or can return the matrix that is used the create the heatmap.
Annotate peaks with chromHMM segments
We can annote a given peak set like this:
## summary of target set annotation with feature annotation:
## Rows in target set: 8934
## ----------------------------
## percentage of target elements overlapping with features:
## 1_Active_Promoter 10_Txn_Elongation 11_Weak_Txn 12_Repressed
## 28.16 0.07 2.10 1.33
## 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV 2_Weak_Promoter
## 2.74 0.08 0.18 23.86
## 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer 6_Weak_Enhancer
## 8.69 11.91 16.33 32.65
## 7_Weak_Enhancer 8_Insulator 9_Txn_Transition
## 8.03 5.72 1.00
##
## percentage of feature elements overlapping with target:
## 1_Active_Promoter 10_Txn_Elongation 11_Weak_Txn 12_Repressed
## 19.81 0.04 0.16 0.68
## 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV 2_Weak_Promoter
## 0.27 0.19 0.57 6.76
## 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer 6_Weak_Enhancer
## 6.45 20.36 11.49 3.46
## 7_Weak_Enhancer 8_Insulator 9_Txn_Transition
## 0.53 0.84 0.71
##
We can annotate the GRangesList of different peak sets like this:
## Working on: p300
## Working on: SP1
## Working on: NANOG
Make a heatmap of percentage of peaks overlapping with different chromHMM annotations
We can make a heatmap using heatTargetAnnotation() function. The function also returns a ggplot2 object, so it can be further manipulated via ggplot2.

We can also get the matrix that is used to make the heatmap and use it in other heatmap functions.

sessionInfo()
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.5 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] devtools_1.11.1 gplots_3.0.1 genomation_1.5.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 formatR_1.4
## [3] GenomeInfoDb_1.8.3 plyr_1.8.4
## [5] XVector_0.12.1 bitops_1.0-6
## [7] tools_3.3.0 zlibbioc_1.18.0
## [9] digest_0.6.10 memoise_1.0.0
## [11] tibble_1.1 evaluate_0.9
## [13] gtable_0.2.0 BSgenome_1.40.1
## [15] gridBase_0.4-7 yaml_2.1.13
## [17] parallel_3.3.0 withr_1.0.1
## [19] rtracklayer_1.32.2 stringr_1.0.0
## [21] knitr_1.13.1 caTools_1.17.1
## [23] gtools_3.5.0 Biostrings_2.40.2
## [25] S4Vectors_0.10.2 IRanges_2.6.1
## [27] stats4_3.3.0 Biobase_2.32.0
## [29] data.table_1.9.6 impute_1.46.0
## [31] plotrix_3.6-3 XML_3.98-1.4
## [33] BiocParallel_1.6.5 seqPattern_1.4.0
## [35] rmarkdown_0.9.6 gdata_2.17.0
## [37] reshape2_1.4.1 readr_1.0.0
## [39] ggplot2_2.1.0 magrittr_1.5
## [41] codetools_0.2-14 matrixStats_0.50.2
## [43] Rsamtools_1.24.0 scales_0.4.0
## [45] htmltools_0.3.5 BiocGenerics_0.18.0
## [47] GenomicRanges_1.24.2 GenomicAlignments_1.8.4
## [49] assertthat_0.1 SummarizedExperiment_1.2.3
## [51] colorspace_1.2-6 labeling_0.3
## [53] KernSmooth_2.23-15 stringi_1.1.1
## [55] RCurl_1.95-4.8 munsell_0.4.3
## [57] chron_2.3-47
Thursday, October 15, 2015
New features in genomation package
- Extending genomation to work with paired-end BAM files
- Accelerate functions responsible for reading genomic files
- Parallelizing data processing in ScoreMatrixList
- Arithmetic, indicator and logic operations as well as subsetting work on ScoreMatrix objects
-
Improvements and new arguments in visualization functions
- Faster heatmaps
- New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix
- Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix
- Central tendencies in line plots: centralTend in plotMeta
- Smoothing central tendency: smoothfun in plotMeta
- Plotting dispersion around central lines in line plots: dispersion in plotMeta
- Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function
- Integration with Travis CI for auto-testing
Genomation is an R package to summarize, annotate and visualize genomic intervals. It contains a collection of tools for visualizing and analyzing genome-wide data sets, i.e. RNA-seq, bisulfite sequencing or chromatin-immunoprecipitation followed by sequencing (ChIP-seq) data.
Recently we added new features to genomation and here we present them on example of binding profiles of 6 transcription factors around the CTCF binding sites derived from ChIP-seq. All new functionalities are available in the latest version of genomation that can be found on its github website.
# install the package from github
library(devtools)
install_github("BIMSBbioinfo/genomation",build_vignettes=FALSE)
Extending genomation to work with paired-end BAM files
Genomation can work with paired-end BAM files. Mates from reads are treated as fragments (they are stitched together).
library(genomation)
genomationDataPath = system.file('extdata',package='genomationData')
bam.files = list.files(genomationDataPath, full.names=TRUE, pattern='bam$')
bam.files = bam.files[!grepl('Cage', bam.files)]
Accelerate functions responsible for reading genomic files
This is achived by using readr::read_delim function to read genomic files instead of read.table. Additionally if skip=“auto” argument is provided in readGeneric_or track.line=“auto” in other functions that read genomic files, e.g. _readBroadPeak then UCSC header is detected (and first track).
library(GenomicRanges)
ctcf.peaks = readBroadPeak(file.path(genomationDataPath,
'wgEncodeBroadHistoneH1hescCtcfStdPk.broadPeak.gz'))
ctcf.peaks = ctcf.peaks[seqnames(ctcf.peaks) == 'chr21']
ctcf.peaks = ctcf.peaks[order(-ctcf.peaks$signalValue)]
ctcf.peaks = resize(ctcf.peaks, width=1000, fix='center')
Parallelizing data processing in ScoreMatrixList
We use ScoreMatrixList function to extract coverage values of all transcription factors around ChIP-seq peaks. ScoreMatrixList was improved by adding new argument cores that indicates number of cores to be used at the same time (by using parallel:mclapply).
sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num=50, type='bam', cores=2)
# descriptions of file that contain info. about transcription factors
sampleInfo = read.table(system.file('extdata/SamplesInfo.txt',
package='genomationData'),header=TRUE, sep='\t')
names(sml) = sampleInfo$sampleName[match(names(sml),sampleInfo$fileName)]
Arithmetic, indicator and logic operations as well as subsetting work on score matrices
Arithmetic, indicator and logic operations work on ScoreMatrix, ScoreMatrixBin and ScoreMatrixList
objects, e.i.:
Arith: “+”, “-”, “*”, “”, “%%”, “%/%”, “/”
Compare: “==”, “>”, “<”, “!=”, “<=”, “>=”
Logic: “&”, “|”
sml1 = sml * 100
sml1
## scoreMatrixlist of length:5
##
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50
Subsetting:
sml[[6]] = sml[[1]]
sml
## scoreMatrixlist of length:6
##
## 1. scoreMatrix with dims: 1681 50
## 2. scoreMatrix with dims: 1681 50
## 3. scoreMatrix with dims: 1681 50
## 4. scoreMatrix with dims: 1681 50
## 5. scoreMatrix with dims: 1681 50
## 6. scoreMatrix with dims: 1681 50
sml[[6]] <- NULL
Improvements and new arguments in visualization functions
Due to large signal scale of rows of each element in the ScoreMatrixList we scale them.
sml.scaled = scaleScoreMatrixList(sml)
Faster heatmaps
HeatMatrix and multiHeatMatrix function works faster by faster assigning colors. Heatmap profile of scaled coverage shows a colocalization of Ctcf, Rad21 and Znf143.
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500))
New clustering possibilities in heatmaps: “clustfun” argument in multiHeatMatrix
clustfun allow to add more clustering functions and integrate them with the heatmap function multiHeatMatrix. It has to be a function that returns a vector of integers indicating the cluster to which each point is allocated. Previous version of multiHeatMatrix could cluster rows of heatmaps using only k-means algorithm.
# k-means algorithm with 2 clusters
cl1 <- function(x) kmeans(x, centers=2)$cluster
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1)
# hierarchical clustering with Ward's method for agglomeration into 2 clusters
cl2 <- function(x) cutree(hclust(dist(x), method="ward"), k=2)
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl2)
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
Defining which matrices are used for clustering: “clust.matrix” in multiHeatMatrix
clust.matrix argument indicates which matrices are used for clustering. It can be a numerical vector of indexes of matrices or a character vector of names of the ‘ScoreMatrix’ objects in 'ScoreMatrixList'. Matrices that are not in clust.matrix are ordered according to the result of the clustering algorithm. By default all matrices are clustered.
multiHeatMatrix(sml.scaled, xcoords=c(-500, 500), clustfun = cl1, clust.matrix = 1)
Central tendencies in line plots: centralTend in plotMeta
We extended visualization capabilities for meta-plots. plotMeta function can plot not only mean, but also median as central tendency and it can be set up using centralTend argument. Previously user could plot only mean.
plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
xcoords=c(-500, 500),
winsorize=c(0,99),
centralTend="mean")
Smoothing central tendency: smoothfun in plotMeta
We added smoothfun argument to smooth central tendency as well as dispersion bands around it which is shown in the next figure. Smoothfun has to be a function that returns a list that contains a vector of y coordinates (vector named '$y').
plotMeta(mat=sml.scaled, profile.names=names(sml.scaled),
xcoords=c(-500, 500),
winsorize=c(0,99),
centralTend="mean",
smoothfun=function(x) stats::smooth.spline(x, spar=0.5))
Plotting dispersion around central lines in line plots: dispersion in plotMeta
We added new argument dispersion to plotMeta that shows dispersion bands around centralTend. It can take one of the arguments:
- “se” shows standard error of the mean and 95 percent confidence interval for the mean
- “sd” shows standard deviation and 2*(standard deviation)
- “IQR” shows 1st and 3rd quartile and confidence interval around the median based on the median +/- 1.57 * IQR/sqrt(n) (notches)
plotMeta(mat=sml, profile.names=names(sml),
xcoords=c(-500, 500),
winsorize=c(0,99),
centralTend="mean",
smoothfun=function(x) stats::smooth.spline(x, spar=0.5),
dispersion="se", lwd=4)
Calculating scores that correspond to k-mer or PWM matrix occurence: patternMatrix function
We added new function patternMatrix that calculates k-mer and PWM occurrences over predefined equal width windows. If one pattern (character of length 1 or PWM matrix) is given then it returns ScoreMatrix, if more than one character ot list of PWM matrices then ScoreMatrixList. It finds either positions of pattern hits above a specified threshold and creates score matrix filled with 1 (presence of pattern) and 0 (its absence) or matrix with score themselves. windows can be a DNAStringList object or GRanges object (but then genome argument has to be provided, a BSgenome object).
#ctcf motif from the JASPAR database
ctcf.pfm = matrix(as.integer(c(87,167,281,56,8,744,40,107,851,5,333,54,12,56,104,372,82,117,402,
291,145,49,800,903,13,528,433,11,0,3,12,0,8,733,13,482,322,181,
76,414,449,21,0,65,334,48,32,903,566,504,890,775,5,507,307,73,266,
459,187,134,36,2,91,11,324,18,3,9,341,8,71,67,17,37,396,59)),
ncol=19,byrow=TRUE)
rownames(ctcf.pfm) <- c("A","C","G","T")
prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25)
priorProbs = prior.params/sum(prior.params)
postProbs = t( t(ctcf.pfm + prior.params)/(colSums(ctcf.pfm)+sum(prior.params)) )
ctcf.pwm = Biostrings::unitScale(log2(postProbs/priorProbs))
library(BSgenome.Hsapiens.UCSC.hg19)
hg19 = BSgenome.Hsapiens.UCSC.hg19
p = patternMatrix(pattern=ctcf.pwm, windows=ctcf.peaks, genome=hg19, min.score=0.8)
Visualization of the patternMatrix patternMatrix (here as ScoreMatrix object) can be visualized using i.e. heatMatrix, heatMeta or plotMeta functions.
heatMatrix(p, xcoords=c(-500, 500), main="CTCF motif")
plotMeta(mat=p, xcoords=c(-500, 500), smoothfun=function(x) stats::lowess(x, f = 1/10),
line.col="red", main="ctcf motif")
Integration with Travis CI for auto-testing
Recently we integrated genomation with Travis CI. It allows users to see current status
of the package which is updated during every change of the package. Travis
automatically runs R CMD CHECK and reports it. Shields shown below are on the genomation github site:
https://github.com/BIMSBbioinfo/genomation
Status
# <br />
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3
## [3] rtracklayer_1.28.10 Biostrings_2.36.4
## [5] XVector_0.8.0 GenomicRanges_1.20.8
## [7] GenomeInfoDb_1.4.3 IRanges_2.2.9
## [9] S4Vectors_0.6.6 BiocGenerics_0.14.0
## [11] genomation_1.1.27 BiocInstaller_1.18.5
## [13] devtools_1.9.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 formatR_1.2.1
## [3] futile.logger_1.4.1 plyr_1.8.3
## [5] bitops_1.0-6 futile.options_1.0.0
## [7] tools_3.2.2 zlibbioc_1.14.0
## [9] digest_0.6.8 gridBase_0.4-7
## [11] evaluate_0.8 memoise_0.2.1
## [13] gtable_0.1.2 curl_0.9.3
## [15] yaml_2.1.13 proto_0.3-10
## [17] httr_1.0.0 stringr_1.0.0
## [19] knitr_1.11 data.table_1.9.6
## [21] impute_1.42.0 R6_2.1.1
## [23] plotrix_3.5-12 XML_3.98-1.3
## [25] BiocParallel_1.2.22 seqPattern_1.0.1
## [27] rmarkdown_0.8.1 readr_0.1.1
## [29] reshape2_1.4.1 ggplot2_1.0.1
## [31] lambda.r_1.1.7 magrittr_1.5
## [33] matrixStats_0.14.2 MASS_7.3-44
## [35] scales_0.3.0 Rsamtools_1.20.5
## [37] htmltools_0.2.6 GenomicAlignments_1.4.2
## [39] colorspace_1.2-6 KernSmooth_2.23-15
## [41] stringi_0.5-5 munsell_0.4.2
## [43] RCurl_1.95-4.7 chron_2.3-47
## [45] markdown_0.7.7
Tuesday, March 31, 2015
Using genomation to analyze methylation profiles from Roadmap epigenomics and ENCODE
The genomation package is a toolkit for annotation and visualization of various genomic data. The package is currently also in BioC. It allows to analyze high-throughput data, including bisulfite sequencing data. Here, we will visualize the distribution of CpG methylation around promoters and their locations within gene structures on human chromosome 3.
Heatmap and plot of meta-profiles of CpG methylation around promoters
In this example we use data from Reduced Representation Bisulfite Sequencing (RRBS) andWhole-genome Bisulfite Sequencing (WGBS) techniques and H1 and IMR90 cell types
derived from the ENCODE and the Roadmap Epigenomics Project databases.
We download the datasets and convert them to GRanges objects. Using rtracklayer and genomation functions. We also use a refseq bed file for annotation and extraction of promoter regions using readTranscriptFeatures function.
Since we have read the files now we can build base-pair resolution matrices of scores(methylation values) for each experiment. The returned list of matrices can be used to draw heatmaps or meta profiles of methylation ratio around promoters.
Distribution of covered CpGs across gene regions
- Count overlap statistics between our CpGs from WGBS and RRBS H1 cell type and gene structures
- Calculate percentage of CpGs overlapping with annotation
- plot them in a form of pie charts
Tuesday, February 18, 2014
summary, annotation and visualization of genomic intervals with genomation package
UPDATE:
genomation is published in Bioinformatics
Akalin A, Franke V, Vlahoviček K, Mason CE, Schübeler D. genomation: a toolkit to summarize, annotate and visualize genomic intervals. Bioinformatics. 2014 Nov 21. pii: btu775
It is also currently in the Bioconductor development version
http://www.bioconductor.org/packages/devel/bioc/html/genomation.html


