A. Rayan Chikhi’s slides – comprehensive yet introductory
Conclusions -
What is a good assembly ?
* No total order
* Main metrics : N50, coverage, accuracy
* Use QUASTHow are assemblies made ?
* Typically, using a de Bruijn graph or a string graph.
* Errors and small variants are removed from the graph.
* Contigs are just simple paths from the graph.Assembly software
* Recommended software for Illumina data : SOAPdenovo2, Allpaths-LG
* Plethora of other software for custom needs : Minia for low-memory, SGA for
very accurate assembly, etc..
* Recommended software for 454 data : Newbler, CeleraA few tips
* How to choose k : always try many values
* Put the assembler inside a pipeline : error correction, scaffolding, gap-fillingCase study
* How to assemble a human genome with Minia
B.
C. A Good Thread in SeqAnswers Forum
Original question -
I have some new Illumina data (HiSeq 100b reads- one paired-end (94xe6 reads) and one mate-pair (54xe6 reads) lib.) for a fungal genome (ca. 30MB) for which a pretty good reference is already assembled/available.
My coverage is about 400X, and I have de novo assembled the new data with both Velvet (VelvetOptimiser) and Soapdenovo, but based on simple metrics, e.g. # scaffolds, largest scaffold, N50, this new assembly doesn’t appear to be quite as good as the reference.
I don’t have access to the read data used to assemble the original reference, and I would like see if I can improve it with this additional data. It looks like you can give Velvet a -long switch for a reference seq, but the documentation isn’t very clear on this. And, I’m not sure how to go about generating a “new” reference sequence/scaffolds after, for example, using an aligner, e.g. Bowtie or BWA, to align the new read data to the reference seq.
Can someone suggest/describe the best approach or a pipeline to get where I want to go with this dataset?












use Mummer to align contigs from assembly to reference
fill in gaps in assembly with sequence from reference
align reads to new chimeric assembly, visualize in IGV
correct errors observed from alignment, particularly in regions taken from reference
repeat steps 3 + 4 until assembly is correct (good coverage, zero SNPs/INDELs across assembly)
painful, but yields reference quality assembly in a few days
Thanks Kevin.
[...] [...]
[...] [...]
[...] [...]
[...] Three Helpful Guides for Those Working on Genome Assembly [...]