Three Helpful Guides for Those Working on Genome Assembly blog is written by professional janitors dedicated to clean up US science. During lunch breaks and other time off from the job, we discuss bioinformatics. The name '' is not a spelling mistake, but is derived by taking Arabic translation of the 'O' in the original word.

Please follow us on twitter – @homolog_us.


A. Rayan Chikhi’s slides – comprehensive yet introductory

Conclusions -

What is a good assembly ?
* No total order
* Main metrics : N50, coverage, accuracy

How are assemblies made ?
* Typically, using a de Bruijn graph or a string graph.
* Errors and small variants are removed from the graph.
* Contigs are just simple paths from the graph.

Assembly software
* Recommended software for Illumina data : SOAPdenovo2, Allpaths-LG
* Plethora of other software for custom needs : Minia for low-memory, SGA for
very accurate assembly, etc..
* Recommended software for 454 data : Newbler, Celera

A few tips
* How to choose k : always try many values
* Put the assembler inside a pipeline : error correction, scaffolding, gap-filling

Case study
* How to assemble a human genome with Minia


C. A Good Thread in SeqAnswers Forum

Original question -

I have some new Illumina data (HiSeq 100b reads- one paired-end (94xe6 reads) and one mate-pair (54xe6 reads) lib.) for a fungal genome (ca. 30MB) for which a pretty good reference is already assembled/available.

My coverage is about 400X, and I have de novo assembled the new data with both Velvet (VelvetOptimiser) and Soapdenovo, but based on simple metrics, e.g. # scaffolds, largest scaffold, N50, this new assembly doesn’t appear to be quite as good as the reference.

I don’t have access to the read data used to assemble the original reference, and I would like see if I can improve it with this additional data. It looks like you can give Velvet a -long switch for a reference seq, but the documentation isn’t very clear on this. And, I’m not sure how to go about generating a “new” reference sequence/scaffolds after, for example, using an aligner, e.g. Bowtie or BWA, to align the new read data to the reference seq.

Can someone suggest/describe the best approach or a pipeline to get where I want to go with this dataset?

Heroes and Heroines of New Media--2014

I am strongly influenced by Charles Hugh Smith, who runs his insightful social blog of Two Minds. I hope he will not mind, if I copy his style of acknowledgement to the supporters of our blog.

Our blog is deeply honored by the generous contribution of the following readers. Without their patronage, this site would go away.

Outstandingly Generous:  
Amemiya C. Schnable J. Bowman B.

We are also looking for subscribers to get help to finish the tutorials. Please see this post for details.

6 comments to Three Helpful Guides for Those Working on Genome Assembly

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>