Sunday, November 21, 2010

Shortread assemblers: trans-ABySS, cufflinks and a bit of CLC assembler

Recently, I had some experience with shortread assembly programs. I liked the trans-ABySS idea to merge different de novo assemblies with different k-mer parameters, and it seems to work better than using only one k-mer. So the pipeline is roughly as follows:

1) produce many assemblies with ABySS with different k-mer lengths (do it for all k-mer lengths between read_length/2 and read_length)
2) merge those assemblies with trans-ABySS
3) filter and map them to the genome and filter again.

Potentially, you can use parts of trans-ABySS pipeline to merge assemblies from other de novo assemblers such as Oases and CLC assembler. The details on trans-ABySS can be found in their hidden manual, but ABySS itself doesn't have a very good manual. But they have a mail group where you can ask questions, I would prefer they had a proper manual still. The problem with ABySS and trans-ABySS is that they have a lot of dependencies, so you need patience when installing.

Furthermore, it will take a lot of time to produce that many assemblies and merge them, especially if you don't have easy access to a computer cluster. Having said that, tophat+cufflinks are not extremely fast either. Mapping reads to genome with tophat and creating transcripts with cufflinks takes 1-2 days on 6 processors on Mac Pro for ~80M paired-end and ~80M single-end reads . I like tophat + cufflinks because they have very good manuals and pre-compiled binaries.

I also tried the trial version of CLC assembler, which is a commercial product. First, I didn't like that it wasn't free. Second, I didn't like that you can't choose a k-mer size that is longer than 31 base pairs. But compared to ABySS it was very fast. it took maybe less than an hour until it completed 99% of the assembly but then it crashed with a "segmentation fault". It is promising but it is expensive and it didn't exactly work for me. It would be better if at least the command line version was free. People who can use command line version and don't have access to a cluster will probably be able to rent time on Amazon Cloud and assemble their sequences using ABySS. By doing this, they will be spending much less money. I heard CLC is around $50k (Don't know how reliable that information is though).

Other assemblers are out there as well, I didn't have time to try them yet.
Oases + velvet : heard good ("worked well for me") and bad things ("heavy on memory" and "was very slow") about it, but I didn't have time to try it yet.
Scripture          : Initially heard good stuff about it but it has worse accuracy according to trans-ABySS paper, I didn't try it because of time constraints.

No comments:

Post a Comment