Archives

Categories

Y
Y

String Graph Assembler

Homolog.us blog is written by professional janitors dedicated to clean up US science. During lunch breaks and other time off from the job, we discuss bioinformatics. The name 'homolog.us' is not a spelling mistake, but is derived by taking Arabic translation of the 'O' in the original word.

Please follow us on twitter – @homolog_us.

-------------------------------------------------------------------------------------------------------------

A recent Genome Research paper describing an innovative approach for assembling large genomes from NGS data caught our attention for several reasons. The paper is coauthored by Jared Simpson, the developer of ABySS assembler and Richard Durbin, who runs one of the strongest research groups in bioinformatics. Here we will briefly explain what we like in their approach, and then go over the details of the algorithm in our following commentaries.

The most appealing aspect of the Simpson-Durbin algorithm is that it does not rely on de Bruijn graphs, and instead employs a different graph construction approach called ‘string graph’. String graphs were first proposed by E. W. Myers in a 2005 publication.

What is wrong with de Bruijn graph method?

i) Memory requirement – As you are well aware, construction of de Bruijn graphs by partitioning all reads into k-mers is memory-intensive. For mammalian-sized genomes, de Bruijn graph construction from short reads can take over 1000 GB or more RAM, which is prohibitively expensive.

ii) Loss of information – When we chop reads into k-mers, their connectivity information is lost. Longer the reads, more information is lost.

Instead of constructing de Bruijn graph, Simpson-Durbin assembler computes overlaps between all read pairs, and then constructs a string graph based on the overlaps. Finally, they derive the genome assembly from the string graph. The memory required by string graph method is considerably lower than de Bruijn graphs. Thus, this new approach can potentially handle mammalian-sized genomes at a significantly lower cost. Quoting their latest paper on comparison of SGA, ABySS, Velvet and SOAPdenovo,

Of the four assemblers, SGA used the least memory (4.5 GB vs. 14.1 GB, 23.0 GB and 38.8 GB for ABySS, Velvet and SOAPdenovo, respectively). The de Bruijn graph assemblers were considerably more computationally efficient, however, as the SGA assembly required approximately eight times more CPU hours than ABySS, 20 times more than Velvet, and three times more than SOAPdenovo. This speed difference is largely due to the time required to build the FM-index. However, we can reuse one FM-index for multiple runs of SGA, for instance to try different error correction or assembly parameters, whereas the de Bruijn table for ABySS, Velvet, and SOAPdenovo must be recalculated for each choice of k.

Why did nobody else try similar approach before?

It is because the computation of overlaps between all read pairs is very time-consuming. The most important innovation of Simpson-Durbin approach comes into reducing this computational time from quadratic order to linear order of the number of reads (or rather their sequence length). They achieved it by representing the reads as suffix array, and then finding the overlaps using Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform. The core of Simpson-Durbin algorithm, where they presented the above step, is available from an earlier paper. In the latest paper, they described use of their technique on mammalian-sized sequence data.

If you want to learn about String Graph assembler, please read the following papers -

i) The Fragment Assembly String Graph – E. W. Myers

This paper describes the String Graph concept.

ii) Efficient construction of an assembly string graph using the FM-index – Jared T. Simpson and Richard Durbin

This earlier paper from Simpson and Durbin

iii) Efficient de novo assembly of large genomes using compressed data structures – Jared T. Simpson and Richard Durbin




-------------------------------------------------------------------------------------------------------------
Heroes and Heroines of New Media--2014

I am strongly influenced by Charles Hugh Smith, who runs his insightful social blog of Two Minds. I hope he will not mind, if I copy his style of acknowledgement to the supporters of our blog.

Our blog is deeply honored by the generous contribution of the following readers. Without their patronage, this site would go away.

Outstandingly Generous:  
Amemiya C. Schnable J. Bowman B.

We are also looking for subscribers to get help to finish the tutorials. Please see this post for details.

3 comments to String Graph Assembler

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>