Wednesday, February 1, 2012

samtools in parallel

GNU parallel is a great tool for parallelizing your samtools jobs and making things considerably faster. GNU parallel is also available through macports (useful, if you have OS X operating system).

Here is an example on how to use it. Consider a case where you have multiple SAM files in a folder and they need to be converted to BAM and indexed.

here is a slow way to achive it via a bash script. The script loops over the SAM files in the directory.
 for sample in *.sam  
 do  
   echo $sample  
   describer=$(echo ${sample} | sed 's/.sam//')  
   echo $describer  
   
   # Convert file from SAM to BAM format  
   samtools view -b $sample > ${describer}.uns.bam  
   
   # Sort BAM file  
   samtools sort ${describer}.uns.bam ${describer}   
   
   # index the bam file  
   samtools index ${describer}.bam  
   
   # Remove intermediate files  
   rm ${describer}.uns.bam  
 done 


here is how you can do it using GNU parallel and using 4 cores:

 
ls *.sam | parallel -j4 -k bash convert2bam.sh {}


The contents of covert2bam.sh:

  
sample=$1
describer=$(echo ${sample} | sed 's/.sam//')  
   
# Convert file from SAM to BAM format  
samtools view -b $sample > ${describer}.uns.bam  
   
# Sort BAM file  
samtools sort ${describer}.uns.bam ${describer}   
   
# index the bam file  
samtools index ${describer}.bam  
   
# Revove intermediate files  
rm ${describer}.uns.bam  


EDIT: and here is how you can achieve the same result without creating the intermediate file via piping (Thanks to the commentators)

  

 ls *.sam | parallel "samtools view -b -S {} | samtools sort - {.}; samtools index {.}.bam"
 

7 comments:

  1. Would this work?

    ls *.sam | parallel "samtools view -b {} | samtools sort - {.}.bam; samtools index {.}.bam"

    If so then you avoid the temporary file and thus you probably save disk I/O.

    ReplyDelete
  2. Possibly ! Either that or something along those lines might work. If you try it, please let us know if it works or not?

    ReplyDelete
  3. I would also try to go for piping samtools commands, it is very I/O intensive and on your average workstation four samtools processes will absolutely saturate the filesystem if the results are written back before performing the sort and index. Will check it out and leave a note when done. But thanks for the heads-up on GNU parallel!

    ReplyDelete
  4. Agreed, let us know when you try the piping.

    ReplyDelete
  5. ls *.sam | parallel "samtools view -b -S {} | samtools sort - {.}; samtools index {.}.bam"

    The piping works!

    ReplyDelete
  6. This is a great way to do those mundane sam/bam transformations, but can this piping method be used by aligners such as BWA?

    ReplyDelete