Seven Major Trend Changes of 2013 - (ii) Bioinformatics
2. In NGS Bioinformatics, SPAdes Assembler, BCR Algorithm and Diginorm Approach Gained Prominence
a) SPAdes and Scaffolding Problem
Velvet was all the rage in 2012, when we first reported about SPAdes from Russia’s Algorithmic Biology lab (Check here, here and here). One year later, SPAdes gained recognition by performing well in the GAGE-B evaluation.
In this respect, the biggest change in perception had been the realization that scaffolding was a non-trivial step adding to most assembly errors. Authors of SPAdes assembler spent considerable intellectual fire power to solve the scaffolding problem.
We wrote several commentaries on other excellent assemblers or on assembly- related issues. A small subset is here.
SPAdes and MaSuRCA Assemblers Performed Best in GAGE-B Evaluation
Our First Look at SOAPdenovo2 Source Code
On MaSuRCA Paper and Algorithm
Cleverness of the Ray Assembler
Very Helpful Preprocessing Module for Those Interested in Assembling Genomes
Rayan Chikhis KmerGenie Slides from HitSeq 2013
GAM-NGS and REAPR Papers are Published
b) Alternate Approaches for Processing Large Libraries - Bauer-Cox-Rosone Algorithm/BEETL/Ropebwt and Kmer Counting/Diginorm/Sailfish
Heng Li brought our attention to a series of excellent papers written by Anthony Cox, Giovanna Rosone and co-authors in 2012. In “Heng Li Releases Ropebwt2”, we wrote -
We discussed BCR algorithm and related topics on BWT construction from short reads many times (here - check comment section, here, here and here). Readers may find the following implementation useful (h/t: @rayanchikhi). Version 1 of this code is an important component of SGA assembler.
The goal of their work was to improve the processing of large NGS files by converting them into FM index. When we checked “the problems being solved by the top NGS bioinformaticians today?” in Nov 2013, we found that most had a project related to implementation of similar algorithm. That definitely qualifies as a change in perception, and it is driven by the large sizes of HiSeq libraries and the time it takes to compress/process/ftp them.
Titus Brown and other researchers followed a different approach to solve the same problem of large libraries. They converted each read into k-mer units and used the k-mer collection as a substitute for the reads to expedite processing. Digital normalization used k-mers to prune the read collection, Sailfish used them to rapidly compute expression from RNAseq data without alignment and kSNP used k-mers to find SNPs without alignment. For k-mer counting, programs like Jellyfish are popular, but readers may take a look at the following low-memory methods from Rayan Chikhi and Titus Brown.
DSK: K-mer Counting with Very Low Memory Usage
Efficient Online k-mer Counting using a Probabilistic Data Structure
What has not been generally recognized is that k-mer counting and building BWT are equivalent approaches. In fact, one can easily build the Burrows Wheeler transform of a large library by starting from its k-mer counts.
Speaking of blogs on algorithms, here is an incomplete list of several excellent ones -
We will soon add a few others to the above list.
In the following commentary, we will cover -
Seven Major Trend Changes of 2013 (iii) Genomics