Recipes, scripts and genomics: sequencing

Showing posts with label sequencing. Show all posts

Sunday, July 5, 2015

Hands-on computational genomics course in Berlin

Berlin Institute for Medical Systems Biology is organizing a computational genomics course and R programming will be used for most practical sessions. The course will cover basic statistics, programming and basic concepts in next-generation sequencing as well as it is applications such as RNA-seq and ChIP-seq. There will be practical sessions every day. The course will last 7 days and will be taught at MDC-Berlin campus between 18-24 October 2015.

Course Modules

The course modules are as follows. Almost all modules have practical sessions and participants will have a chance to directly apply what they learned.
- Introduction to R and Bioconductor
- Statistics and Exploratory Data analysis
- Introduction to Next-gen sequencing
- Applications of computational genomics
- RNA-seq analysis
- ChIP-seq analysis
- Data integration and visualization

The instructors

The instructors include high-profile local and external scientists working on computational genomics.
See the full list at: http://compgen2015.mdc-berlin.de/

Apply by 30 July 2015

see instructions at: http://compgen2015.mdc-berlin.de/

Event poster

Computational genomics course poster 2015 (BIMSB/MDC-Berlin) from Altuna Akalin

Wednesday, February 1, 2012

samtools in parallel

GNU parallel is a great tool for parallelizing your samtools jobs and making things considerably faster. GNU parallel is also available through macports (useful, if you have OS X operating system).

Here is an example on how to use it. Consider a case where you have multiple SAM files in a folder and they need to be converted to BAM and indexed.

here is a slow way to achive it via a bash script. The script loops over the SAM files in the directory.

 for sample in *.sam  
 do  
   echo $sample  
   describer=$(echo ${sample} | sed 's/.sam//')  
   echo $describer  
   
   # Convert file from SAM to BAM format  
   samtools view -b $sample > ${describer}.uns.bam  
   
   # Sort BAM file  
   samtools sort ${describer}.uns.bam ${describer}   
   
   # index the bam file  
   samtools index ${describer}.bam  
   
   # Remove intermediate files  
   rm ${describer}.uns.bam  
 done

here is how you can do it using GNU parallel and using 4 cores:

 
ls *.sam | parallel -j4 -k bash convert2bam.sh {}

The contents of covert2bam.sh:

  
sample=$1
describer=$(echo ${sample} | sed 's/.sam//')  
   
# Convert file from SAM to BAM format  
samtools view -b $sample > ${describer}.uns.bam  
   
# Sort BAM file  
samtools sort ${describer}.uns.bam ${describer}   
   
# index the bam file  
samtools index ${describer}.bam  
   
# Revove intermediate files  
rm ${describer}.uns.bam

EDIT: and here is how you can achieve the same result without creating the intermediate file via piping (Thanks to the commentators)

  

 ls *.sam | parallel "samtools view -b -S {} | samtools sort - {.}; samtools index {.}.bam"

Tuesday, December 20, 2011

methylKit: R package for DNA methylation analysis

High-throughput bisulfite sequencing based methods are popular for measuring genome-wide DNA methylation levels. Here is an R package that helps with the analysis of such DNA methylation data. Although, it is still under heavy development current functionality allows users to do many of the essential data analysis tasks. The package is primarily designed for Reduced Representation Bisulfite Sequencing (RRBS), however it can also handle whole-genome bisulfite sequencing if proper input format is provided.

Functionality includes:

Read coverage statistics
Methylation statistics
Sample correlation and clustering
Differential methylation analysis
Feature annotation and accessor/coercion functions
Multiple visualization options
Regional and tiling windows analysis
(Almost) proper documentation

The package is here:
http://code.google.com/p/methylkit/

Sunday, November 21, 2010

Shortread assemblers: trans-ABySS, cufflinks and a bit of CLC assembler

Recently, I had some experience with shortread assembly programs. I liked the trans-ABySS idea to merge different de novo assemblies with different k-mer parameters, and it seems to work better than using only one k-mer. So the pipeline is roughly as follows:

1) produce many assemblies with ABySS with different k-mer lengths (do it for all k-mer lengths between read_length/2 and read_length)
2) merge those assemblies with trans-ABySS
3) filter and map them to the genome and filter again.

Potentially, you can use parts of trans-ABySS pipeline to merge assemblies from other de novo assemblers such as Oases and CLC assembler. The details on trans-ABySS can be found in their hidden manual, but ABySS itself doesn't have a very good manual. But they have a mail group where you can ask questions, I would prefer they had a proper manual still. The problem with ABySS and trans-ABySS is that they have a lot of dependencies, so you need patience when installing.

Furthermore, it will take a lot of time to produce that many assemblies and merge them, especially if you don't have easy access to a computer cluster. Having said that, tophat+cufflinks are not extremely fast either. Mapping reads to genome with tophat and creating transcripts with cufflinks takes 1-2 days on 6 processors on Mac Pro for ~80M paired-end and ~80M single-end reads . I like tophat + cufflinks because they have very good manuals and pre-compiled binaries.

I also tried the trial version of CLC assembler, which is a commercial product. First, I didn't like that it wasn't free. Second, I didn't like that you can't choose a k-mer size that is longer than 31 base pairs. But compared to ABySS it was very fast. it took maybe less than an hour until it completed 99% of the assembly but then it crashed with a "segmentation fault". It is promising but it is expensive and it didn't exactly work for me. It would be better if at least the command line version was free. People who can use command line version and don't have access to a cluster will probably be able to rent time on Amazon Cloud and assemble their sequences using ABySS. By doing this, they will be spending much less money. I heard CLC is around $50k (Don't know how reliable that information is though).

Other assemblers are out there as well, I didn't have time to try them yet.
Oases + velvet : heard good ("worked well for me") and bad things ("heavy on memory" and "was very slow") about it, but I didn't have time to try it yet.
Scripture : Initially heard good stuff about it but it has worse accuracy according to trans-ABySS paper, I didn't try it because of time constraints.

Friday, November 19, 2010

Command-line preprocessing of short-read data

FASTX toolkit provides command line tools for processing FASTA and FASTQ files. For example, you can filter sequences based on their quality, you can trim low-quality sequence at the ends of the sequence and you can trim adapters, in addition to many other functions.

check out: http://hannonlab.cshl.edu/fastx_toolkit/

Pages