Building an NGS Reference List (de novo assembly category)

Building an NGS Reference List (de novo assembly category)

Did we miss any important category/paper?

Not all t’s are crossed and i’s are dotted yet. We will also add hyperlinks soon.

We created this list for our own convenience. However, it took us some time to get all the pieces together, and we thought posting the full list here would help someone else go to sleep early. Wherever we could, we added short narration about the papers or the section (again for our own convenience). There is no guarantee that the texts describe the papers accurately. Neither is there guarantee that the texts describe the papers inaccurately.

The papers mentioned here are related to bioinformatics problems, when no reference genome exists. We split the other set of papers with alignment, SNP calling, etc. into another set.

1. Pre-NGS Genome Assemblers

This section includes old assembly-related papers that we may need to cite from time to time. The first subgroup has links to base-calling and error correction programs such as phred, phrap and consed. The second subgroup has more sophisticated assemblers, but they mostly belong to overlap-layout- consensus type. Those dinosaurs used to rule over the world in not too distant past.

CAP3 and Celere are the only programs that we like. They are our pet dinosaurs. That does not mean the others are bad. We heard very good opinions about them from, well, the genome centers that nurture them. Most programs are associated with one or other genome centers. TIGR is from TIGR. ARACHNE belongs to Broad. Phusion is from EMBL. Atlas came from Baylor. Celera was written by J. C. Venter’s company, but is maintained by the bioinformatics group from Maryland.

Base-calling and error detection

Krawetz SA (1989) Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation.Nucleic Acids Res. 17(10):3951-7. Link

Bonfield JK, Staden R (1995) The application of numerical estimates of base calling accuracy to DNA sequencing projects.Nucleic Acids Res. 23(8):1406-10. Link

Ewing B, Hillier L, Wendl M, Green P: Basecalling of automated sequencer traces using phred. I. Accuracy assessment.Genome Research 8:175-185 (1998). Link

Ewing B, Green P: Base calling of automated sequencer traces using phred. II. Error probabilities.Genome Research 8:186-194 (1998). Link

Gordon D, Abajian C, Green P: (1998) Consed: a graphical tool for sequence finishing.Genome Research 8:195-202 Link


Sutton, G. G., White, O., Adams, M. D., Kerlavage, A. R. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects.Genome Science and Technology. 1(1): 9-19. Link

Huang X, Madan A, (1999) CAP3: A DNA sequence assembly program.Genome Res. 9(9):868-77. Link

Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, et al.: (2000) A whole-genome assembly of Drosophila. **Science287(5461):2196-2204. **Link

Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES: (2002) ARACHNE: A whole-genome shotgun assembler. **Genome Res12(1):**177-189. Link

Mullikin JC, Ning ZM: (2003) The phusion assembler. **Genome Res13(1):**81-90. Link

Istrail, S. et al. (2004) Whole-Genome Shotgun Assembly and Comparison of Human Genome Assemblies. Proc. Nat. Acad. Sci. USA 101:1916-1921. Link

David B. Jaffe, Jonathan Butler, Sante Gnerre, Evan Mauceli, Kerstin Lindblad- Toh,Jill P. Mesirov, Michael C. Zody,and Eric S. Lander (2003) **Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2 **Genome Res. 13(1):

  1. Link

Havlak P, Chen R, Durbin KJ, Egan A, Ren YR, Song XZ, Weinstock GM, Gibbs RA (2004) The atlas genome assembly system. **Genome Res14(4):**721-732. Link

Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, et al. (2011) Meraculous: ****De Novo** Genome Assembly with Short Paired-End Reads.**PLoS ONE 6(8): e23501. Link

Mihai Pop, Adam Phillippy, Arthur L. Delcher, Steven L. Salzberg (2004) Comparative Genome Assembly, Briefings in Bioinformatics 5 (3):237-248. Link

R. Xia and A. Kim (2012) MERmaid: A Parallel Genome Assembler for the Cloud.Link

Weber JL, Myers EW (1997) Human whole-genome shotgun sequencing. Genome Res 7: 401409. Link

2. NGS Genome Assemblers

non de Bruijn, k-mer based

You need a video to understand this category (pay attention to the guys, who jumped early).

Sundquist A, Ronaghi M, Tang HX, Pevzner P, Batzoglou S: (2007) Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies. **PLoS ONE, **2(5). **Link**

Warren RL, Sutton GG, Jones SJM, Holt RA: Assembling millions of short DNA sequences using SSAKE. **Bioinformatics 2007, **23(4):500-501. Link

SHORTY (Chen and Skiena, 2007) [specialised in localising the use of paired- end reads.]

Daniel D Sommer, Arthur L Delcher, Steven L Salzberg**and **Mihai Po (2007) **Minimus: a fast, lightweight genome assembler BMC Bioinformatics 2007, **8:64. Link

de Bruijn graph-based assemblers

Many articles in our blog explained this class of assemblers.

Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2: 291306. Link

Pevzner, Pavel A.; Tang, Haixu (2001). Fragment Assembly with Double- Barreled Data. Bioinformatics/ISMB 1: 19. Link

Chaisson M, Pevzner P, Tang H (2004) Fragment assembly with short reads. Bioinformatics 20: 2067-74. Link

Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. **Proc. Nat. Acad. Sci. USA 2001, **98(17):9748-9753. Link

Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. **Genome Research 2008, **18(5):821-829. Link

Zerbino, D., Genome assembly and comparison using de Bruijn graphs Ph.D. Thesis, EBI, UK. Link

Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly of whole-genome shotgun microreads. **Genome Research 2008, **18(5):810-820. Link

Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: A parallel assembler for short read sequence data. **Genome Research 2009, **19(6):1117-1123. Link

Boisvert S, Laviolette F, Corbeil J. J Comput Biol. (2010) Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Nov;17(11):1519-33. Epub 2010 Oct 20. Link


Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J (2010) De novo assembly of human genomes with massively parallel short read sequencing. **Genome Research20(2):**265-272. Link

Li R, Fan W, Tian G, et al. (2010) The sequence and de novo assembly of the giant panda genome. **Nature463 (7279):**311-317. Link

Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, et al. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. **P Natl Acad Sci USA108(4):**1513-1518. Link

Chitsaz H, Yee-Greenbaum J, Tesler G, Lombardo M, Dupont C, et al. (2011) Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol 29: 915-21. Link

Rodrigue S, Malmstrom R, Berlin A, Birren B, Henn M, et al. (2009) Whole genome amplication and de novo assembly of single bacterial cells. PLoS One 4: e6864. Link

Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, et al. (2010) Complete Khoisan and Bantu genomes from southern Africa.Nature 463:

  1. Link


Deng HW, Lin Y, Li J, Shen H, Zhang L, Papasian CJ (2011) Comparative studies of de novo assembly tools for next-generation sequencing technologies. **Bioinformatics27(15):**2031-2037. Link

Salzberg SL, Phillippy AM, Zimin AV, Puiu D, Magoc T, Koren S, Treangen T, Schatz MC, Delcher AL, Roberts M, et al. (2011) **GAGE: A critical evaluation of genome assemblies and assembly algorithms. **Genome Res. Link

Zhang WY, Chen JJ, Yang Y, Tang YF, Shang J, Shen BR (2011) A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. **_PLoS ONE, _6(3). **Link

Mago? T, Salzberg SL (2011) **FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies. **Bioinformatics. Link

Salzberg SL, Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C (2004) Versatile and open software for comparing large genomes. **Genome Biol, **5(2). **Link**

D. Earl et al. (2011)** Assemblathon 1: A competitive assessment of de novo short read assembly methods, **Genome Research, 21:2224-2241. Link

3. Exomes, Transcriptomes, Metagenomes and Highly Polymorphic Genomes

Transcriptome Assemblers

Robertson D, Schein J, Chiu R, Corbett R. Field M et al. (2010) De novo** assembly and analysis of RNA-seq data** Nature Methods 7, 909912. Link

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA et al. (2011) **Full- length transcriptome assembly from RNA-seq data without a reference genome. **Nat Biotechnol. 29(7):644-52. Link

Schulz M, Zerbino D, Vingron M, Birney E (2012) Oases: robust de novo rna- seq assembly across the dynamic range of expression levels. Bioinformatics 28: 1086-92. Link


Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2011) MetaVelvet: An extension of Velvet assem-bler to de novo metagenome assembly from short sequence reads. ACM Conference on Bioinformatics, Computational Biology and Biomedicine. Link

Peng Y, Leung H, Yiu S, Chin F (2011) **Meta-idba: a de novo assembler for metagenomic data. **Bioinformatics 27: i94-101. Link

Vaughn Iverson, Robert M. Morris, Christian D. Frazar, Chris T. Berthiaume, Rhonda L. Morales, E. Virginia Armbrust (2012) Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota, Science 335(6068):587-590. Link

C. T. B. on scaling metagenome

Polymorphic Genomes

Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet 44: 226-32. Link

S. Huang et al. (2012) **HaploMerger: Reconstructing allelic relationships for polymorphic diploid genome assemblies, **_Genome Research. _Link

Targeted Assembly

P. Peterlongo and R. Chikhi** (2011) Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer BMC Bioinformatics 2012, **13:48. Link

Ren L. Warren Robert A. Holt (2011)** Targeted Assembly of Short Sequence Reads PLOS ONE. **Link

4. Faster, better, cheaper

k-mer counting

Bloom BH (1974) Space/time trade-offs in hash coding with allowable errors. **Commun ACM, **13:422-426. Link

Marais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. **Bioinformatics 2011, **27(6):764-770. Link

Melsted P, Pritchard JK (2011) Efficient counting of k-mers in DNA sequences using a bloom filter. **Bmc Bioinformatics12. **Link

C. T. Brown** khmer **


Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics ;25:274-275.

Conway TC, Bromage AJ (2011) **Succinct data structures for assembling large genomes. **Bioinformatics27(4):479-486. Link

Fritz MH-Y, Leinonen R, Cochrane G, Birney E. (2011) **Efficient storage of high throughput DNA sequencing data using reference-based compression. **Genome Res.;21:734-740.

Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. **Bioinformatics20(18):**3363-3369. Link

Pinho A, Pratas D, Garcia S (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 40: e27. Link

Daniel C. Jones, W. L. Ruzzo, X. Peng, M. G. Katze (2012) **Compression of next-generation sequencing reads aided by highly efficient de novo assembly, Nucleic Acids Research, **Link

Rayan Chikhiand Guillaume Rizk (2012) **Space-efficient and exact de Bruijn graph representation based on a Bloom Filter. **Link

Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G. (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol. (3):401-13.Link

Pell J. **et al.(2012) Scaling metagenome sequence assembly with probabilistic de Bruijn graphs, **_Proc. Nat. Acad. Sci. USA. _Link

C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data

Error correction

Kelley D, Schatz M, Salzberg S: (2010) Quake: quality-aware detection and correction of sequencing errors. **Genome Biology11(11):**R116.

Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high- throughput sequencing datasets with non-uniform coverage. Bioinformatics 27: i137-41.

C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data


Schatz M (2009) Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics 25: 1363-9.

Contrail < bio/index.php?title=Contrail>


Shi H, Schmidt B, Liu W, Mller-Wittig W (2010) A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware. **Journal of Computational Biology17(4):**603-615.

Liu Y, Schmidt B, Maskell DL (2011) Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics, 12:354. Link

String graph assembler

Myers EW (2005) The fragment assembly string graph. **Bioinformatics21:**79-85. Link

Simpson JT, Durbin R: **Efficient de novo assembly of large genomes using compressed data structures. **Genome Res 2011. Link

Simpson J, Durbin R (2010) Efficient construction of an assembly string graph using the fm-index.

Bioinformatics 26: i367-73. Link

Simpson J, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22: 549-56. Link


Zerbino DR, McEwen GK, Margulies EH, Birney E (2009) Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4: e8407. Link

Koren S, Treangen T, Pop M (2011) Bambus 2: scaffolding metagenomes.Bioinformatics 27: 2964-71. Link

Alexey A. Gritsenko et al. (2012) GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies Bioinformatics 28(11): 1429-1437 Abstract


Zerbino DR, McEwen GK, Margulies EH, Birney E (2009) Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4: e8407. Link

Do H, Choi K, Preparata F, Sung W, Zhang L (2008) Spectrum-based de novo repeat detection in genomic sequences. J Comput Biol 15: 469-87. Link

Novak P, Neumann P, Macas J (2010) Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11: 378. Link

Gu W, Castoe T, Hedges D, Batzer M, Pollock D (2008) Identification of repeat structure in large genomes using repeat probability clouds. **Anal Biochem 380: 77-83. **Link

6. Reviews, visions and IMs

Metzker M (2010) Sequencing technologies - the next generation.Nat Rev Genet 11: 31-46. Link

Shendure J. and Ji H. (2008) Next-generation DNA sequencing, Nature Biotechnol. 26, 1135 1145. Link

Phillippy AM, Schatz MC, Pop M (2008) **Genome assembly forensics: finding the elusive mis-assembly.**Genome Biol9(3). Link

Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data.Genomics 95: 315-27. Link

Trapnell C, Salzberg S (2009) How to map billions of short reads onto genomes.Nat Biotechnol 27: 455-7. Link

Stein L. (2010) The case for cloud computing in genome informatics.Genome Biol 11: 207. Link

Li Y, Hu Y, Bolund L, Wang J (2010) State of the art de novo assembly of human genomes from massively parallel sequencing data.Hum Genomics 4: 271-7. Link

Compeau P, Pevzner P, Tesler G (2011) How to apply de bruijn graphs to genome assembly.Nat Biotechnol. 29: 987-91. Link

Nagarajan N, Pop M (2009) Parametric complexity of sequence assembly: theory and applications to next generation sequencing._J Comput Biol _16: 897-908. Link

Carl Kingsford, Michael C Schatz and Mihai Pop (2010) Assembly Complexity of prokaryotic genomes using short reads **BMC Bioinformatics 2010, **11:21 Link

Pop M, Salzberg SL (2008) Bioinformatics challenges of new sequencing technology. Trends Genet 24: 142149. Link

Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6: S6S12. Link (pdf), Link

Written by M. //