Veli Mäkinen’s Book – Genome-Scale Algorithm Design


Veli Mäkinen from University of Helsinki (and Belazzougui, Cunial, Tomescu) wrote a book titled ‘Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing’. I have not read the book yet, but based on reading papers from his group, I expect it to be an insightful one.


Did any of our readers get a chance to go through it yet?


Speaking of books, readers may also enjoy the following three blog posts. We promise not to fill your shelf with anything other than the best ones.

Possibly the Best Book on Personal Genomics and Personalized Medicine

A Review of “Bioinformatics Algorithms – An Active Learning Approach” by Compeau and Pevzner

A Must Read Book, If You Like to Understand Genomes

QuorUM: An Error Corrector for Illumina Reads

This new paper comes from G. Marçais, Alex Zimin and Jim Yorke of minimizer fame.

Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.

We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.

QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at

Readers may also take a look at Heng Li’s error correction program published preprinted early this year.

Steven Salzberg Publishes Reanalysis of UCSF’s Defective Lancet Paper in F1000

After Yoav Gilad’s reanalysis of mouse ENCODE data, respected computational biologist Steven Salzberg took the F1000 route to critically re-examine the claims of another high-profile paper.


Over the last few years, bio-medical journals have seen severe outbreak of high-profile papers of poor quality. Cleaning up the mess through the same exclusive journals is extremely difficult (check the experiences of Dan Graur or Lior Pachter), but thanks to alternate publishing venues, there is finally some hope.

Today’s paper under fire is about the recent enterovirus D68 outbreak that caused ‘mystery paralysis’ and polio-like symptoms. The authors claimed to have found (A novel outbreak enterovirus D68 strain associated with acute flaccid myelitis cases in the United States from 2012-2014: a retrospective cohort study) through metagenomic sequencing –


48 patients were included: 25 with acute flaccid myelitis, two with enterovirus-associated encephalitis, five with enterovirus-D68-associated upper respiratory illness, and 16 with aseptic meningitis or encephalitis who tested positive for enterovirus. Enterovirus D68 was detected in respiratory secretions from seven (64%) of 11 patients comprising two temporally and geographically linked acute flaccid myelitis clusters at the height of the 2014 outbreak, and from 12 (48%) of 25 patients with acute flaccid myelitis overall. Phylogenetic analysis revealed that all enterovirus D68 sequences associated with acute flaccid myelitis grouped into a clade B1 strain that emerged in 2010. Of six coding polymorphisms in the clade B1 enterovirus D68 polyprotein, five were present in neuropathogenic poliovirus or enterovirus D70, or both. One child with acute flaccid myelitis and a sibling with only upper respiratory illness were both infected by identical enterovirus D68 strains. Enterovirus D68 viraemia was identified in a child experiencing acute neurological progression of his paralytic illness. Deep metagenomic sequencing of cerebrospinal fluid from 14 patients with acute flaccid myelitis did not reveal evidence of an alternative infectious cause to enterovirus D68.

Salzberg found several defects in their analysis. The authors did not use any standard bioinformatics program and instead reinvented their version of square wheel, which they published as a separate bioinformatics paper. Unfortunately, those great programs failed to find out that two patients had other major bacterial infections (Haemophilus influenzae and severe Staphylococcus aureus). Moreover, the authors did not remove human DNA from patients’ data before public submission, as they claimed to have done likely to protect patients’ privacy. The full abstract of Salzberg is shown below –

Re-analysis of metagenomic sequences from acute flaccid myelitis patients reveals alternatives to enterovirus D68 infection

Metagenomic sequence data can be used to detect the presence of infectious viruses and bacteria, but normal microbial flora make this process challenging. We re-analyzed metagenomic RNA sequence data collected during a recent outbreak of acute flaccid myelitis (AFM), caused in some cases by infection with enterovirus D68. We found that among the patients whose symptoms were previously attributed to enterovirus D68, one patient had clear evidence of infection with Haemophilus influenzae, and a second patient had a severe Staphylococcus aureus infection caused by a methicillin-resistant strain. Neither of these bacteria were identified in the original study. These observations may have relevance in cases that present with flaccid paralysis because bacterial infections, co-infections or post-infection immune responses may trigger pathogenic processes that may present as poliomyelitis-like syndromes and may mimic AFM. A separate finding was that large numbers of human sequences were present in each of the publicly released samples, although the original study reported that human sequences had been removed before deposition.

Thankfully, the main claim of the original paper is still valid as Dr. Salzberg pointed out in a private email.

I should add that the main finding of Greninger et al. is not really in doubt – their primary analysis showed that the 2014 strain of enterovirus D68 formed a new clade. But the secondary analyses included some surprising flaws.



1. Omics!Omics! blog wrote –

Leaky clinical metagenomics pipelines are a very serious issue


2. Charles Chiu, the senior author of UCSF paper, posted a response on his website signed by three authors (incl. him). I am quoting it here, but please feel free to discuss at the F1000 site.

This manuscript raises two main criticisms of our paper in their re-analysis. Here we directly address the 2 main points:

1. The authors claim that bacterial reads were seen in the nasopharyngeal / oropharygneal swab (NP/OP) metagenomic data that were “missed” in the original study. They were able to assemble two genomes, from Haemophilus influenzae and methicillin-resistant Staphylococcus aureus. We would like to highlight that these reads and bacteria were not missed but instead we did not discuss the bacterial/fungal portion of the NP/OP swabs in the manuscript due to difficulties in clinical interpretation.
As quoted from the supplementary section of our manuscript published in Lancet Infectious Diseases, with our emphasis in underline:

Lancet ID Supplementary Material: NGS libraries constructed from NP/OP samples were treated with DNase following nucleic acid extraction to reduce background from the human host and bacterial flora. As this protocol reduces sensitivity of detection and speciation for non-viral microbes (i.e. bacteria, fungi, and parasites), only viral sequences are shown for the NP/OP samples. The ability to detect DNA viruses is also impacted by the use of DNAse and we cannot exclude the possibility that our data is biased by reduced sensitivity for detection of DNA viruses.

a) The authors also state that we “did not report finding any bacterial reads from this sample” but that does not mean that we did not detect any bacterial reads. In fact, we detected both the Haemophilus influenzae and methicillin-resistant Staphylococcus aureus reported by the authors:

The number of bacterial reads found in all metagenomic samples analyzed is shown in Supplementary Table 3, Column N. Bacterial reads for the two samples in question are listed as US/CA/09-871: 5,614,487 bacterial reads and US/CA/12-5837: 28,676,383 bacterial reads.

In the main text we also state that the purpose of metagenomic NGS on NP/OP samples was to “to aid in the recovery of enterovirus D68 genome sequences and detect potential co-infections from other viruses”.

b) In addition, although we reported the presence of bacterial reads in the NP/OP data, detected using SURPI/SNAP, we did not discuss this in the paper due to a number of reasons:
Our focus on looking for CSF (bacterial viral fungal and parasite) pathogens in the setting of AFM our clinical perspective that the presence of bacterial reads in NP/OP swabs from children most often reflect colonization / carriage and not infection (see,,,, etc. for papers on nasal carriage of Staphylococcus and Haemophilus in healthy children)
our treatment of the nucleic acid extracts with DNase to help reduce host background, which would bias accurate metagenomic interpretation and the accurate counting of bacterial and fungal reads since this procedure degrades bacterial / fungal genomic DNA due to high levels of respiratory tract colonization, NP/OP samples nearly always contain bacterial sequences that will constitute the majority of the non-human reads and beyond reporting read numbers, we did not comment on this in the paper.

c) Of note, the sample in question with Haemophilus influenzae (US/CA/09-871) came from a 2009 case of upper respiratory infection alone. This subject did not have either encephalitis or acute flaccid myelitis (Table 2 and results in Greninger et al.). The statements in the manuscript that “This patient was reported as having encephalitis and severe respiratory illness…” and “the sequence evidence here suggests that the patient might have had complications from H. influenzae-associated encephalitis or encephalomyelitis…” as well as the title of the manuscript are therefore incorrect.

2. The authors point out that there are residual human reads in the deposited data.

We acknowledge that we have been using SNAP, which is a global aligner, to extract out human reads for our SRA submissions.  This algorithm will miss human reads because of low-complexity sequences at the ends, residual adapters, etc.  We agree that we probably should have used a local aligner such as BLASTn at a low threshold level to more completely extract out human reads, or a k-mer approach such as Kraken. We appreciate the authors’ point on the importance of clearing human sequences from metagenomic data and will plan on resubmitting this more completely filtered data to the SRA in the near future.

The re-analysis presented here gives the erroneous impression that bacterial colonization has a strong association with the devastating acute flaccid myelitis (AFM) syndrome seen in our patients. As we note above, Haemophilus influenzae was found in a control patient with no neurological symptoms, and is thus not relevant to AFM. While MRSA colonization may possibly be a factor in AFM, it is extremely unlikely given its detection is a single case and the fact that MRSA nasal carriage in healthy children is well-described. We should also point out that bacterial cultures of cerebrospinal fluid (CSF) from all of the patients in the study were negative and that metagenomic next-generation sequencing did not reveal any evidence of bacterial infection in the central nervous system (CNS).


Optimal Seed Solver: Optimizing Seed Selection in Read Mapping


The core algorithm

A naive brute-force solution to find the optimal seeds of a read would be systematically iterating through all possible combinations of seeds. We start with selecting the first seed by instantiating all possible positions and lengths of the seed. On top of each position and length of the first seed, we instantiate all possible positions and lengths of the second seed that is sampled after (to the right-hand side of) the first seed.We repeat this process for the rest of the seeds until we have sampled all seeds. For each combination of seeds, we calculate the total seed frequency and find the minimum total seed frequency among all combinations.

The key problem in the brute-force solution above is that it examines a lot of obviously suboptimal combinations. For example,…..

…The above observation suggests that by summarizing the optimal solutions of partial reads under a smaller number of seeds, we can prune the search space of the optimal solution. Specifically, givenm (withm < x) seeds and a substring U that starts from the beginning of the read, only the optimal m seeds of U could be part of the optimal solution of the read. Any other suboptimal combinations of m seeds of U should be pruned.



Motivation: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the potential of the mapper to select less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds.
Results: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-bp read in O(x×L) operations on average and in O(x×L2) operations in the worst case. We compared OSS against four state-of-the-art seed selection schemes and observed that OSS provides a 3-fold reduction of average seed frequency over the best previous seed selection optimizations.

Tools or Algorithms, What is More Important in Bioinformatics? – The Answer Will Shock You

There are often discussions about whether the bioinformaticians should spend more time thinking about better algorithms or building more user-friendly implementations. The answer is neither, as a cancer benchmarking study found out. Quoting from ‘A Comprehensive Assessment of Somatic Mutation Calling in Cancer Genomes’ –

We also detected distinct clustering of different analysts. Even though analysts used similar combinations of software in their pipelines, very few similarities were detected in their calls. The combination of software is not as critical as how each piece of software is applied and what settings are applied. A slight correlation of true positive SSM calls of pipelines using MuTect and Strelka can be seen (Figure 4). Data analysis pipelines are constructed by integrating software from diverse sources. In many instances the approaches used in the different software and the assumptions made therein are not evident to the user and many parts are black boxes. In order to account for unknowns, pipeline developers resort to calibrating their pipelines against known results, for example from a genotyping experiment done on similar samples, and by using combinations of tools for the same process step, assuming that a result shared by two different approaches has a higher likelihood to be correct. Of note is that many of the pipelines apply multiple tools for the same process step in their pipeline and then use intersects to weight calls. This practice, together with good use of blacklists, results in best outputs. No best tools emerged from our benchmark and it is also clear that no strict best threshold settings exist. However, it is clear that how analysts use their pipeline is critical for the quality of their output.

A Polynomial Delay Algorithm for the Enumeration of Bubbles with Directed Graph

Today’s interesting paper is here

The problem of enumerating bubbles with length constraints in directed graphs arises in transcriptomics where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq.

We present a new algorithm for enumerating bubbles with length constraints in weighted directed graphs. This is the first polynomial delay algorithm for this problem and we show that in practice, it is faster than previous approaches.

This settles one of the main open questions from Sacomoto et al. (BMC Bioinform 13:5, 2012). Moreover, the new algorithm allows us to deal with larger instances and possibly detect longer alternative splicing events.

BGT: Efficient and Flexible Genotype Query Across Many Samples

Heng Li posted a new paper at arxiv on BGT format, which “stores the integer matrix as two positional BWTs (PBWTs), one for the lower bit and the other for the higher bit.”

The problem -

VCF/BCF (Danecek et al., 2011) is the primary format for storing and analyzing genotypes of multiple samples. It however has a few issues. Firstly, as a streaming format, VCF compresses all types of information together. Retrieving site annotations or the genotypes of a few samples usually requires to decode the genotypes of all samples, which is unnecessarily expensive. Secondly, VCF does not take advantage of linkage disequilibrium (LD), while using this information can dramatically improve compression ratio (Durbin, 2014). Thirdly, a VCF record is not clearly defined. Each record may consist of multiple alleles with each allele composed of multiple SNPs and INDELs. This ambiguity complicates annotations, query of alleles and integration of multiple data sets. At last, most existing VCF-based tools do not support expressive data query. We frequently need to write scripts for advanced queries, which costs both development and processing time. GQT (Layer et al., 2015) attempts to solve some of these issues. While it is very fast for selecting a subset of samples and for traversing all sites, it discards phasing, is inefficient for region query and is not compressed well. The observations of these limitations motivated us to develop BGT.

The solution -

Summary: BGT is a compact format, a fast command line tool and a simple web application for efficient and convenient query of whole-genome genotypes and frequencies across tens to hundreds of thousands of samples. On real data, it encodes the haplotypes of 32,488 samples across 39.2 million SNPs into a 7.4GB database and decodes a couple of hundred million genotypes per CPU second. The high performance enables real-time responses to complex queries.
Availability and implementation: here

There Will be no Grexit and Here Are the Consequences

Most people still cannot wrap their head around what is going on in Greece and how it will affect their lives. Let me summarize that for you, but not before you get an intro on the banking system.

Banking System Fact 1.

(Physical) cash is the core of the banking system. The electronic money people see in their bank accounts is a derivative of that cash. It may or may not be converted into physical cash, when you ask for it.

Banking System Fact 2.

The total amount of physical cash in the banking system is limited. The situation is similar to gold in the 1930s. Yes, it is true that the governments can print banknotes, but they are restricted by the banks to not do so.

Banking System Fact 3.

The worldwide banking system is going to blow up due to lack of collateral and the world is going cash only. You can see that process by following the events in Greece over the last few years. In a cash-only regime, very little electronic cash will be accepted anywhere, because a person/business accepting electronic money will be taking the risk of being locked within a bank holiday somewhere.


Based on above the three facts, can you tell me the people of which country in Europe have the highest amount of cash right now? Then after the blow up, who will be the most ahead?

In 1930s analogy, Greeks ran away with Europe’s ‘gold’ and that was the main purpose of this drip-drip-drip bank run over the last six months funded by ECB. Do not underestimate Varoufakis.

[Photo via (h/t@acting-man blog)]


In the meanwhile, enjoy a number of great cartoons on Greece here.

Diploid Human Genome Assembly using Pacbio Technology

We are surprised that the researchers are using technology from a company that was supposed to go out of business due to competition from Oxford Nanopore. A new Nature Method paper reports –

Assembly and diploid architecture of an individual human genome via single-molecule technologies

We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.

A condensed version of major claims in the paper is available from this post in Pacbio blog.

UCSC’s David Haussler Wins Award through Deceptive Claims, Gene Myers Loses

I came across an article titled ‘A visionary, a genius, and the human genome‘ and checked what the hoopla was all about. What I found was quite shocking.


It was May 2000, the race to sequence the human genome was on, and UC Santa Cruz Biomolecular Engineering Professor David Haussler was worried.

A private firm named Celera Genomics was beating a path to the prize with a big budget and what was reported to be the most powerful computer cluster in civilian use.

Meanwhile, an international consortium of scientists—which Haussler had only recently been invited to join—was lagging behind.

Haussler, a tall man with a penchant for Hawaiian shirts, had managed to wrangle 100 Dell Pentium III processor workstations for the project. About 30 had been purchased but others had been intended for student use—until Dean of Engineering Patrick Mantey and Chancellor M.R.C. Greenwood agreed Haussler could “break in” the machines before they went into classrooms.

Each of the computers was less powerful than one of today’s smart phones, but, nevertheless, the UC Santa Cruz group was able to link them together for parallel processing, creating a makeshift “supercomputer” for the project.

What happened next was the stuff of movies: a genius and a brilliant computer scientist at an upstart university defying the odds to become the first in the world to assemble the DNA pieces of the human genome.

Why does UCSC bring up human genome project after 15 years and write a revisionist history? It is because ‘Genomics Institute director David Haussler awarded prestigious Dan David Prize‘.

Joseph Klafter, president of Tel Aviv University and chair of the Dan David Prize Board, said in a video announcement of the prize laureates, “Professor David Haussler presented the first draft of the human genome sequence and developed the Genome Browser used worldwide for interpreting genome sequences. The browser includes tools for identifying and comparing genes, for accessing information on gene structure, function, and regulation, and for revealing gene-disease relationships. Dr. Haussler introduced machine learning techniques to bioinformatics, becoming a central paradigm in the field.”

Based on the literature, Haussler did not write any genome assembler any time before May 2000 and Haussler did not contribute to genome assembly field any time after July 2000. Gene Myers (an author of BLAST and also the inventor of suffix arrays) and his student Kececioglu developed the shotgun assembly algorithm in 1991 (Kececioglu’s thesis) and published about it in many papers since then. When Myers wrote a paper in 1997 explaining that the same algorithm could be used for human genome assembly, the clowns from government-backed human genome project protested. That was the main reason of Celera’s existence. We covered all that history in the following two blog posts.

An Opinionated History of Genome Assembly Algorithms – (i)

An Opinionated History of Genome Assembly Algorithms – (ii)

At Celera, Myers developed an algorithm that had been used for countless assemblies since then (including Drosophila, human and mouse). Even today, Jason Chin of Pacbio uses Celera assembler for assembling long-reads. Myers published a review paper in April 2000 describing the software along with the Drosophila genome paper that came out in the same issue.

Anyone, who has done a genome assembly and read through those early papers, clearly understands that Haussler’s claims of ‘doing the first assembly of human genome’ is utterly deceptive. The government-backed biologists kept repeating the claim to save their own back, but Haussler should know better. Therefore, if he accepts an award for ‘doing the first assembly of human genome’ with a straight face, he is also due for Manolis Kellis award for integrity sooner or later.