Working with large and/or many filesThis turns out to be more and more of a problem. When we first designed methylKit it was tested on RRBS data with couple of million CpGs covered. These days many people have either many RRBS or similar reduced representation sequencing results or a fair amount whole-genome bisulfite sequencing. Due to their size, it may be difficult to work with these kind of samples in-memory. We developed a flatfile database based on "tabix"(http://www.htslib.org/doc/tabix.html). This means large objects are written out as tabix files, they are compressed and indexed. We can work with them chunk-by-chunk in a parallelized manner. However, that comes with a cost. It will take time to initially create tabix files but you will be able to work with large data sets without so much of the memory bottleneck.
C/C++ based parsing of Bismark SAM/BAM filesWe used to parse sorted Bismark SAM files with a perl script. Now, that script is converted to C/C++ using samtools libraries and Rcpp in the background. This made parsing faster also got rid of a system dependency. processBismarkAln() function is the one where this functionality is implemented. It can parse sorted BAM/SAM files.
More differential methylation tests and dealing with covariatesWe added more differential methylation tests. Main improvement is that we can adjust for overdispersion within the logistic regression model. In addition, we can also add covariates such as age and sex into the model and take them into account when testing for differential methylation. We also added a beta-binomial distribution based test by wrapping the functions in the bioconductor DSS package.
Methylation profile segmentationRegulatory regions can be discovered by analyzing the methylation patterns, segments with low methylation could indicate a regulatory region. We used a segmentation strategy based on change-point analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copy-number variation) detection but could have applications in methylation context as well. In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. This functionality is also mentioned here in this blog before, now it is a part of the current version.
Reading processed Bismark filesBismark aligner scripts can produce per base methylation files. These files, namely "cytosine report" and "coverage" files can be read in R using methylKit methRead() function. See Bismark manual for details on the files.
Function name changesThe following function names are changed for package wide naming consistency. The old names will continue to work for a while.
- read() to methRead()
- read.bismark to processBismarkAln()
- adjust.methylC() to adjustMethylC()
- get.methylDiff() to getMethylDiff()
- annotate.WithFeature() to annotateWithFeature()
- annotate.WithFeature.Flank() to annotateWithFeatureFlank()
- annotate.WithGenicParts() to annotateWithGenicParts()
- read.bed() to readBed()
- read.feature.flank() to readFeatureFlank()
- read.transcript.features() to readTranscriptFeatures()