Personalized Cancer Screening Will Lead to Many False Diagnosis of Blood Cancer

Here is an interesting paper that calls into question many commonly accepted notions about GWAS and ‘personalized medicine’. The authors studied somatic mutations in blood samples of cancer patients not known to have leukemia or lymphoma, and found many of them to carry mutations linked with blood cancer.


Somatic mutations

It is well-known that our genome is not static, and the cells in the body of every individual accumulate mutations over her lifetime. Ken Weiss often points that out in his blog post in discussing the limitations of genome-based comparative studies for predicting common traits or disease risks. From his recent series –

What if Rev Jenyns had agreed? Part I. Would evolutionary theory be different?

What if Rev Jenyns had agreed? Part II. Would evolutionary theory be different from a population perspective?

What if Rev Jenyns had agreed? Part III. ‘Group’ selection in individuals, too.

But individuals are populations too

Let’s ask something very simple: What is your ‘genotype’? You began life as a single fertilized egg with two instances of human genomes, one inherited from each parent (here, we’ll ignore the slight complication of mitochondrial DNA). Two sets of chromosomes. But that was you then, not as you are now. Now, you’re a mix of countless billions of cells. They’re countless in several ways. First, cells in most of your tissues divide and produce two daughter cells, in processes that continue from fertilization to death. Second, cells die. Third, mutations occur so that each cell division introduces numerous new DNA changes in the daughter cells. These somatic (body cell) mutations don’t pass to the next generation (unless they occur in the germline) but they do affect the cells in which they are found.

But how do we determine your genotype? This is usually done from thousands or millions of cells—say, by sequencing DNA extracted from a blood sample or cheek swab. So what is usually sequenced is an aggregate of millions of instances of each genome segment, among which there is variation. The resulting analysis picks up, essentially, the most common nucleotides at each position. This is what is then called your genotype and the assumption is that it represents your nature, that is, all your cells that in aggregate make you what you are.

In fact, however, you are not just a member of a population of different competing individuals each with their inherited genotypes. In every meaningful sense of the word each person, too, is a i of genomes. A person’s cells live and/or compete with each other in a Darwinian sense, and his/her body and organs and physiology are the net result of this internal variation, in the same sense that there is an average stature or blood pressure among individuals in a population.


Is somatic variation important?

An individual is a group, or population of differing cells. In terms of the contribution of genetic variation among those cells, our knowledge is incomplete to say the least. From a given variant’s point of view (and here we ignore the very challenging aspect of environmental effects), there may be some average risk–that is, phenotype among all sampled individuals with that variant in their sequenced genome. But somatically acquired variation will affect that variant’s effects, and generally we don’t yet know how to take that into account, so it represents a source of statistical noise, or variance, around our predictions. If the variant’s risk is 5% does that mean that 5% of carriers are at 100% risk and the rest zero? Or all are at 5% risk? How can we tell? Currently we have little way to tell and I think manifestly even less interest in this problem.

Cancer is a good, long-studied example of the potentially devastating nature of somatic variation, because there is what I’ve called ‘phenotype amplification': a cell that has inherited (from the person’s parents or the cell’s somatic ancestors) a carcinogenic genotype will not in itself be harmful, but it will divide unconstrained so that it becomes noticeable at the level of the organism. Most somatic mutations don’t lead to uncontrolled cell proliferation, but they can be important in more subtle ways that are very hard to assess at present. But we do know something about them.

Evolution is a process of accumulation of variation over time. Sequences acquire new variants by mutations in a way that generates a hierarchical relationship, a tree of sequence variation that reflects the time order of when each variant first arrived. Older variants that are still around are typically more common than newer ones. This is how the individual genomes inherited by members of a population and is part of the reason that a group perspective can be an important but neglected aspect of our desire to relate genotypes to traits, as discussed yesterday. Older variants are more common and easier to find, but are unlikely to be too harmful, or they would not still be here. Rarer variants are very numerous in our huge, recently expanded human population. They can have strong effects but their rarity makes them hard to analyze by our current statistical methods.

However, the same sort of hierarchy occurs during life as somatic mutations arise in different cells at different times in individual people. Mutations arising early in embryonic development are going to be represented in more descendant cells, perhaps even all the cells in some descendant organ system, than recent variants. But because recent variants arise when there are many cells in each organ, the organ may contain a large number of very rare, but collectively important, variants.

The mix of variants, their relative frequencies, and their distribution of resulting effects are thus a population rather than individual phenomenon, both in populations and individuals. Reductionist approaches done well are not ‘wrong’, and tell us what can be told by treating individuals as single genotypes, and enumerating them to find associations. But the reductionist approach is only one way to consider the causal nature of life.

Our society likes to enumerate things and characterize their individual effects. Group selection is controversial in the sense of explaining altruism, and some versions of group selection as an evolutionary theory have well-demonstrated failings. But properly considered, groups are real entities that are important in evolution, and that helps account for the complexity we encounter when we force hyper-reductionistic, individual thinking to the exclusion of group perspectives. The same is true of the group nature of individuals’ genotypes.

The authors of Nature Medicine paper had been looking into somatic mutations in cancer patients for several years. Here is one of their earlier papers on the topic.

Somatic mutations affect key pathways in lung adenocarcinoma

Determining the genetic basis of cancer requires comprehensive analyses of large collections of histopathologically well-classified primary tumours. Here we report the results of a collaborative study to discover somatic mutations in 188 human lung adenocarcinomas. DNA sequencing of 623 genes with known or potential relationships to cancer revealed more than 1,000 somatic mutations across the samples. Our analysis identified 26 genes that are mutated at significantly high frequencies and thus are probably involved in carcinogenesis. The frequently mutated genes include tyrosine kinases, among them the EGFR homologue ERBB4; multiple ephrin receptor genes, notably EPHA3; vascular endothelial growth factor receptor KDR; and NTRK genes. These data provide evidence of somatic mutations in primary lung adenocarcinoma for several tumour suppressor genes involved in other cancers—including NF1, APC, RB1 and ATM—and for sequence changes in PTPRD as well as the frequently deleted gene LRP1B. The observed mutational profiles correlate with clinical features, smoking status and DNA repair defects. These results are reinforced by data integration including single nucleotide polymorphism array and gene expression array. Our findings shed further light on several important signalling pathways involved in lung adenocarcinoma, and suggest new molecular targets for treatment.


Personalized Medicine

Given somatic mutations exist, we can always continually sequence the genomes or exomes of various tissues of individuals and detect diseases, right? The new paper suggests that such an approach would lead to getting huge number of false positive cancer diagnosis.

Blood Cancer-Related Mutations Commonly Found in Older Individuals

Significant differences in the presentation of mutations in leukemia- or lymphoma-associated genes were observed between age groups. Mutations were noted in 0.88% of individuals aged 40-49 (n = 341), 0.99% of individuals aged 50-59 (n = 707), 1.83% of individuals aged 60-69 (n = 767), 5.26% of individuals aged 70-79 (n = 475), and 6.06% of individuals aged 80-89 (n = 132).

However, this analysis may be underestimating the proportion of individuals with such mutations, the authors wrote, as they only identified small mutations. There remains a need to identify large structural variations or the insertions and deletions of chunks of genetic material associated with hematologic malignancies.

The authors on the study noted that a clinician must be careful in using blood as a surrogate reference. There is a risk of blood-specific variants being mistaken as germline variants in individuals without hematologic malignancies. Further, “germline alleles in cancer samples could be mistaken as tumor-specific variants when comparing to blood samples. Lastly, connections were made between mosaic PPM1D mutations in lymphocytes and the predisposition to breast and ovarian cancers,” the authors wrote.

The authors stated that testing for mutations associated with leukemia and lymphoma would not predict an individual’s risk of a hematologic malignancy.

“We would not want anyone to think they should be screened for these mutations to understand their risk of leukemia or lymphoma,” Timothy Ley, MD, co-author and professor from Washington University said in a statement. “The ability to understand how mutations in these genes increase a person’s risk of blood cancers is a long way off, and genetic testing would be of no benefit at this time.”

And that is just one kind type of disease. Imagine a 40-year old healthy individual going through similar genetic screening for hundreds of diseases, with each type of disease having 1% of false positive rate (5% for 70 year old). Promoters of personalized genetic testing need to think through these issues before making their methods available at all clinics.

Cuckoo Filter to Replace Bloom Filter?

@rozovr forwarded the link to a new paper that is worth looking into.

Cuckoo Filter: Practically Better Than Bloom

In many networking systems, Bloom filters are used for high speed set membership tests. They permit a small fraction of false positive answers with very good space efficiency. However, they do not permit deletion of items from the set, and previous attempts to extend “standard” Bloom filters to support deletion all degrade either space or performance. We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.

The name derives from their use of Cuckoo hash scheme. Compared to Bloom filter –

We propose the cuckoo filter, a practical data structure that provides four major advantages.

1. It supports adding and removing items dynamically;

2. It provides higher lookup performance than traditional Bloom filters, even when close to full (e.g., 95% space utilized);

3. It is easier to implement than alternatives such as the quotient filter; and

4. It uses less space than Bloom filters in many practical applications, if the target false positive rate  is less than 3%.

For more information, check their google plus thread and reddit thread.

Will it replace Bloom filter based k-mer counting methods? Pall Melsted offers his perspective –


Tragedy of the Day – New ‘Revealing’ Human Transcriptome Paper at Biorxiv

Someone forwarded me the link to a new human transcriptome paper, and as an author of the first genome-wide human transcriptome paper exactly 10 years back, I decided to check how far the field advanced over the decade. What I found is quite shocking and it is abundantly clear that these ENCODE-clowns have not learned any lesson through their public humiliation.

The paper examines the transcriptomes of ~400 female twin pairs (~800 individuals), throws all kinds of statistical analysis tools and reports absolutely nothing interesting. No new pathways, mechanisms, nothing. Based on that summary, I thought the title would be ‘comparative analysis of transcriptomes of 800 UK twins shows nothing’. Instead the authors chose ‘transcriptome sequencing reveals widespread gene-gene and gene-environment interactions’. Oh, please !!

BS starts right from the abstract. The last statement of the abstract says – ‘We uncover large GxG and GxE effects on gene expression and likely complex phenotypes that currently remain elusive’. We pondered about the later part of the sentence and could find absolutely no meaning whatsoever. Do the authors mean ‘we uncover complex phenotypes that currently remain elusive’? How can you uncover something that you do not uncover anywhere in the paper? The best explanation is that the authors wanted to include the words ‘complex phenotype’, because that is what gets them money these days. Otherwise, they could as well say ‘we uncover likely the hand of God’ and be as speculative.

The main text starts with –

Gene expression is a cellular phenotype that informs about the functional state of the cell. Gene expression by itself is a complex trait that depends on genetic and environmental causes.

In plain English, they are saying ‘gene expression’ is now the new definition of function/phenotype, and with that broadened definition of phenotype, here comes our speculative paper. ENCODE 3 is reborn !! That is far from the original intention of such research projects, where they were supposed to explain real phenotypes, not redefine phenotype. However, as Sydney Brenner explained, the inverse problem of physiology cannot be solved for mathematical reasons. Hence comes the redefinition and hiding behind ‘complex phenotype’ metaphors.

A high point of the paper is to go through a long list of genes, find one or two well known members through cherry-picking process and then assign some kind of meaningless biological story. The authors picked Epstein-Barr virus induced 3 gene EBI2.

One of the top hits in LCLs is the Epstein-Barr virus induced 3 gene (EBI3) (Figure 3). That means that ASE at the EBI3 gene depends on the interaction of cis genetic variants and an environmental factor that, likely, in this case is the transformation process of the B cells with EBV.

By now, we have learned that all GWAS-type papers end with the obligatory statement that the authors could not uncover something, because their sample size was small. That later statement then becomes the first sentence in the follow-up grant application asking for more money to do the study with 4000 twins. This paper is no exception.

We found an example of GxE interaction on gene expression that has been widely described in the literature (Adiponectin) supporting the validity of our approach. However, the limitation on power due to the sample size prevented us to discover specific associations in most other cases.

Not that lack of funding was any obstacle for their work.

Acknowledgements: This work has been funded by the EU FP7 grant EuroBATS (No. 259749) which also supports AAB, AB, MND, AV, TDS. AAB is also supported by a grant from the South-Eastern Norway Health Authority, (No. 2011060). RD is supported by the Wellcome Trust (No. 098051). The Louis-Jeantet Foundation, Swiss National Science Foundation, Eurpean Research Council and the NIH-NIMH GTEx grant supports ETD. TS is an NIHR senior Investigator and holder of an ERC Advanced Principal Investigator award. JBR and HZ are supported by the Canadian Institutes of Health Research, Fonds de Recherche Sante du Quebec, and the Quebec Consortium for Drug Discovery. Most computations were performed at the Vital-IT ( Center for high-performance computing of the SIB Swiss Institute of Bioinformatics. The TwinsUK study was funded by the Wellcome Trust; EC FP7(2007-2013) and National Institute of Health Research (NIHR). SNP Genotyping was performed by The Wellcome Trust Sanger Institute and National Eye Institute via NIH/CIDR.

In summary, this paper is so meaningless that we suspect it will be channeled into Nature through Magdalena Skipper.


Speaking of constructive criticisms, could the same research be done in a different way? We note that typical biological problems are solved based on three independent sources of data –

1. Palentology (fossil records), which gives the historical perspectives,

2. Phenotypes observed from cell biology and developmental biology,

3. Genetics, which provides as the informational system.

In case of human biology, family history gives information equivalent to palentology.

So, if a researcher is interested in identical twins, he needs to first identify a set of observables based on either family history or looking at the cells under the microscope, and then see whether she can find genetic data to explain the observation. If they do not have any observable to explain, maybe it is time to move on to the next problem.

Instead these authors start by collecting genetic data without any observable to explain and then go on a fishing experiment. When that fails, they redefine gene expression as a ‘complex phenotype’ and write a boring paper to explain nothing.

FPGA-accelerated Bioinformatics at #ASHG – Dragen Aligner from Edico Genome

We thank Max Shen for his generous contribution to support our blog.

We covered several CPU and GPU-based algorithms in “ASHG/GA4GH Special – A Recap of Interesting NGS Algorithms of the Year”, but no FPGA-accelerated one. Readers interested in FPGAs will find the following ASHG abstract interesting. At the bottom, we discuss the pros and cons of using these three hardware alternatives in bioinformatics.

Enhanced fetal aneuploidy detection using hardware accelerated alignment

Session Title: Bioinformatics and Genomic Technology Session Type: Poster
Session Location: Exhibit Hall, Ground Level, Convention Center Session Time: Mon 10:00AM-4:30PM

Program Number: 1675M Presentation Time: Mon, Oct 20, 2014, 2:00PM-3:00PM
Keywords: Bioinformatics and Genomic Technology, KW008 – bioinformatics, KW146 – prenatal diagnosis, KW031 – computational tools, KW103 – massively parallel sequencing, KW023 – chromosomal abnormalities

M. Sykes1, C. Roddey2, M. Ruehle2, R. McMillen2, P. Whitley1
1) Sequenom Inc., San Diego, CA; 2) Edico Genome Inc., La Jolla, CA.

Noninvasive prenatal testing can be performed by massively parallel sequencing of DNA from maternal plasma. This method has been shown effective in the detection of fetal aneuploidies of chromosomes 13, 18, 21 and the sex chromosomes. Accurate classification of these aneuploidies requires, in part, alignment of sequencing reads to the human genome, calculation of chromosome fractions based on these alignments and calculation of z-scores for each chromosome based on these fractions. The success of these steps relies upon the choice of aligner and algorithm used to determine the chromosome fractions.

Here we present reclassification of a dataset of 1269 samples previously analyzed using bowtie 2 as the aligner. In this study alignments are generated by the DRAGEN processor, a hardware-accelerated sequencing analysis system developed by Edico Genome. We report systematic differences between the two aligners but equivalent performance in terms of chromosome fraction variability and thus chromosome quantification.

Both the bowtie 2 and DRAGEN based analyses successfully identified all known T13, T18 and T21 cases in the dataset. The sensitivity and specificity were both > 99.9% in each classification. At the same time the DRAGEN system provides speed increases of greater than thirty-fold relative to bowtie 2 running with 6 threads on a 3.5 GHz Xeon CPU, allowing a single computer to replace the efforts of a small cluster.

These results demonstrate that the classification algorithm for fetal aneuploidy is robust and resistant to localized changes in the alignment profile. Furthermore the DRAGEN system provides equivalent performance to bowtie 2 with a significant increase in speed.

A Comparison of CPU, GPU and FPGA


CPUs are generic processors coming with a small set of hardware-accelerated commands (e.g. ‘add two numbers’, ‘multiply two numbers’, ‘compare two numbers’, ‘jump to a different statement’, etc.). The programmers write code in a high-level language such as C/C++, and then a compiler translated the code into the language that the CPU understands. For example, if someone wants to add five numbers, the compiler will instruct the processor to use ‘add two numbers’ command multiple times.

Advantage – Flexibility in coding. If the biological problem changes from the original one, the software code can be tweaked slightly to solve an entirely new problem. The compiler takes care of making sure the tweaked software code is understood by the hardware.

Disadvantage – Scalability related to memory bandwidth.

The Future of Computers – Multicore and the Memory Wall

The scalability is not always inherent in the architecture, but comes at times from programmers lack of knowledge about the hardware architecture. After all, the promise of software programming is to make the programmers remain oblivious to what is going inside the chip. However, some programmers do know quite a bit about the chip, and use their knowledge to get around the memory bandwidth issue. Here is a beautiful example from Gene Myers –

In DALIGN Paper, Gene Myers Delivers a Major Blow to His Biggest Competitor


The computer display is one specific component, where the issue of processing speed had been most critical. Users were very sensitive to slow displays and were always ready to pay top dollars for a screen that showed the images quickly. GPUs were born from that demand. They are highly parallelized processing chips, except that during most of their early history, GPUs were only used for processing graphics-related instructions.

Advantage Better scalability with large number of parallel executions than CPUs.


(i) All instructions executing at a time have to be of the same type.

(ii) It is important to have some knowledge of the hardware. The companies are trying to remove this restriction by coming up with open-CL and other GPU-specific ‘compilers’, but it is not possible to take full advantage of the hardware without knowing how it works.


FPGAs are hardware chips, which can be reprogrammed to solve any specific problem.

Advantage Speed.

Disadvantage Lack of flexibility and the cost associated with this lack of flexibility.

Historically, the bioinformatics land had been the graveyard of dead FPGA companies.

There are two problems –
(i) The algorithmic landscape change too fast,
(ii) The users want to see output from specific programs (‘give me results matching BLAST or HMMER’) rather than correct results following the same algorithm.

Getting back to the current abstract, more details about the technology can be found here. Based on their numbers, cost for each accelerated server and associated expenses is around $3M for over 4 years. Are we doing the calculation right? The extra computing cost of each genome is $1130-$1000=$130 (see below). For 72,000 genomes, the total computing cost is $9.36M using CPU-based technology (i.e. 50 Dell servers). With Dragen, they claim the cost savings to be $6M, or the total cost of their server = $3.36M.

Earlier this year Illumina announced their HiSeq X Ten, which is a cluster of 10 HiSeq X instruments capable of sequencing up to 18,000 whole human genomes each year with continuous operation.

The HiSeq X Ten has a run cycle of ~3 days and produces ~150 genomes each run cycle. This means that the data generated must be analyzed within 3 days or a backlog will occur. Simple math thus provides a target of ~28 minutes for the completion of the mapping, aligning, sorting, deduplication and variant calling of each genome.

Running the industry standard BWA+GATK analysis pipeline to perform this analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome. To achieve the required throughput of 150 genomes every three days, at least 50 of these servers are required.

With DRAGEN’s run time for the full pipeline being well under 28 minutes, only a single card is required to handle the data produced by a full HiSeq X Ten system running at maximum capacity!

When calculating the costs associated with the legacy compute server infrastructure required to analyze and store the data produced by the HiSeq X Ten, we use the same method that Illumina themselves use to calculate their “$1,000 Genome.” Equipment costs are amortized over 4 years and staff and overheads are included.

By taking into account standard pricing from Dell for the 50 compute servers, industry standard costs for associated IT professionals, rack space and power, as well as AWS pricing for upload and storage of the genomic data over a 4 year period, the $1,000 Genome becomes a $1,130 Genome.

By comparison, DRAGEN significantly reduces the compute cost over the same period while also removing the need for IT professionals and lowering power and rack space requirements. In addition, since DRAGEN implements CRAM-based compression within its pipeline to transparently compress the large amounts of data to be uploaded and stored for later re-analysis, these costs are also significantly lower. When combining all of these savings over the 72,000 genomes sequenced during the 4 year period, DRAGEN provides cost savings of ~$6 million!

Renowned Human Geneticist Michael Hammer Shamed, Joe Pickrell Continues to Support Bad Science

In a world of universal deceit, telling the truth is a revolutionary act – George Orwell

The same spirit applies to the world of population genetics. Needless to say, nobody likes a revolution and the least among which is Joe Pickrell.


We decided to check what the hoopla is about and found a rebuttal paper from Eran Elhaik, Dan Graur and collaborators newly posted in arxiv. We have covered previous episodes of this saga in some detail several months back, and the readers can check “Dan Graur’s Latest Bombshell and Implications for Jewish Genetics”. Briefly, Michael Hammer’s group published a paper in 2013 (Mendez et al.), where they claimed to have rewritten the text book on human evolution. Then, Elhaik et al. redid the analysis and found it to be full of incorrect assumptions.

The most interesting part of the latest rebuttal is in its last section, where an author (Thomas Krahn) of the original Mendez et al. paper acknowledged by email that he and other authors were well aware that the newsworthy claim of the paper could be an artifact of incorrect assumptions made in the analysis. What else do you need?

To learn what Dr. Krahn thinks about the time estimates in Mendez et al. (2013), the missing exchange is published below with Dr. Thomas Krahn’s kind permission.

TK: “While I agree that the outrageous time estimates for A00 from Fernando [Mendez] need an urgent correction and discussion, I don’t think that your argumentation yields enough weight to question the fact that Perry’s Y does in fact represent an ‘extremely ancient’ haplogroup.”

EE: “I am just a bit skeptical about some of their statements in the paper, that the A00 predates human and the calculation of the Y tree in their Figure 1, that doesn’t sound right.”

TK: “Yep, we were extensively discussing this issue. My numbers were more like [your] 200ky for the A00 split but Fernando [Mendez] insisted using autosomal SNP frequency data for dating, which I thought is absolute nonsense for the Y mutation mechanism. Anyhow, it was his job to do the dating maths.”

Lately it has become extremely difficult to correct scientific records, and we can write volumes on the topic based on our experience with the junk science promoted by the positivity lady (check Tragedy of the Day: PNAS Got Duped by Positivity Lady !!, Our New PNAS Paper Debunks the Genomics of Positivity). Her positivity ‘science’ is now part of Arianna Huffington’s latest book available in every bookstore and public library, even though we have shown that the entire paper is garbage. People like Joe Pickrell pose particular challenge to those trying to highlight these examples of extremely bad science, and we have experienced similar resistance with our rebuttal paper of positivity paper from others in the field of psychology. In case of Mendez et al., when the original rebuttal paper of Elhaik et al. came out in January, Joe Pickrell was not happy. He is still not happy with Graur and Elhaik, even though it turns out that the authors of the original paper deliberately tweaked their analysis. This must be the post-modern way of finding truth.

There is no crime, absolutely none, that cannot be condoned when ‘our’ side commits it. – George Orwell

Readers may also find the blog post by Elhaik on the arxiv paper helpful.

The flabbergasting science of Fernando Mendez, Michael Hammer, and co-authors

When I published The missing link of Jewish European Ancestry1, currently ranked #2 most-read paper in GBE, Michael Hammer called me:

“outlier folks… who have a minority view that’s not supported scientifically. I think the arguments they make are pretty weak and stretching what we know.”

Since Hammer and I consider science to be two different things, I took it as a compliment. Little did I know back then that my study rocked Hammer’s world, because at the very same year he and Fernando Mendez published their own outlier paper2 dating the most ancient Y chromosome (A00) to over 338,000 years ago, almost twice the age of Anatomically Modern humans according to fossil evidence. Mendez et al.2 were not coy at all in their most abnormal finding and proudly announced that:

“In contrast to the estimates of all previous studies, our estimate of the Y chromosome TMRCA predates both that of known mtDNA lineages and of the likely time of origin of AMHs on the basis of current fossil evidence.”2

Upon reading their paper, it became obvious to us that Hammer missed (once again) the whole point of doing science. For us, science is a systematic approach that produces accurate and replicable results to explain the universe rather than falsify them. In our critique we have shown3

“that consistently throughout their examination, Mendez et al2 have chosen the assumptions, approximations, numerical miscalculations and data manipulation that inflated the final TMRCA estimate.”

This may not come as a complete shock to the reads of this blog and those of us intimately familiar with Hammer’s previous studies on the Jewish people and particularly the Kohanim4-6, which were challenged by myself and other authors7,8.

As often happens, our critique was very well received and got a media coverage that outshined the original fabricated piece. Even The Holy See commented on that through his speaker, which was quiet surprising because he rarely comments on scientific papers (the comment was primarily meant to negate FOX’s allegation as if we had claimed that we dated the biblical “Adam.” FOX news are somewhat illiterate when it comes to science). A short video we published on YouTube was viewed almost 10,000 times in three days and altogether ~15,000 times! As far as we know, this is the first video ever produced to explain a rather dense scientific paper in plain terms. You can watch it below:

With their mischiefs now becoming public knowledge, it was not unexpected that Hammer and Mendez would response. Amazingly, they remained determined to defend their right to falsify the results even after 55% of their co-authors2 abandoned their sinking flotilla. Again, to those of us familiar with their previous work, this came as no surprise.

What did surprise us was that they also decided to publish “all” of our personal correspondence without permission while neglecting some of the most interesting emails that revealed their practices. Unfortunately, the journal censored these and other parts of our reply.

Mendez et al. (2014) rebuttal along with our vasectomized response now appear in EJHG, but since Fernando Mendez could not restrain himself and published a rebuttal online, we published our original and complete response in arXiv, including the forbidden correspondences, with permission by their author, of course. Among else, you may find the following correspondence between myself (EE) and the second co-author of the first paper2 who is also the original discoverer of the A00 haplotype, Thomas Krahn (TK) back in May 2013:

TK: “While I agree that the outrageous time estimates for A00 from Fernando [Mendez] need an urgent correction and discussion, I don’t think that your argumentation yields enough weight to question the fact that Perry’s Y does in fact represent an ‘extremely ancient’ haplogroup.”

EE: “I am just a bit skeptical about some of their statements in the paper, that the A00 predates human and the calculation of the Y tree in their Figure 1, that doesn’t sound right.”

TK: “Yep, we were extensively discussing this issue. My numbers were more like [your] 200ky for the A00 split but Fernando [Mendez] insisted using autosomal SNP frequency data for dating, which I thought is absolute nonsense for the Y mutation mechanism. Anyhow, it was his job to do the dating maths.”

The readers are invited to guess how many of co-authors would remain on Mendez’s shipwreck for their next duly rebuttal. My guess is that their next paper will not be cited as “Mendez et al.” anymore.


1 Elhaik, E. The Missing Link of Jewish European Ancestry: Contrasting the Rhineland and the Khazarian Hypotheses. Genome Biology and Evolution 5, 61-74, doi:10.1093/gbe/evs119 (2013).

2 Mendez, F. L. et al. An african american paternal lineage adds an extremely ancient root to the human y chromosome phylogenetic tree. Am. J. Hum. Genet. 92, 454-459, doi:S0002-9297(13)00073-6 [pii]

10.1016/j.ajhg.2013.02.002 (2013).

3 Elhaik, E., Tatarinova, T. V., Klyosov, A. A. & Graur, D. The ‘extremely ancient’ chromosome that isn’t: a forensic bioinformatic investigation of Albert Perry’s X-degenerate portion of the Y chromosome. Eur. J. Hum. Genet., doi:10.1038/ejhg.2013.303 (2014).

4 Behar, D. M. et al. The genome-wide structure of the Jewish people. Nature 466, 238-242, doi:nature09103 [pii]

10.1038/nature09103 (2010).

5 Skorecki, K. et al. Y chromosomes of Jewish priests. Nature 385, 32, doi:10.1038/385032a0 (1997).

6 Hammer, M. F. et al. Extended Y chromosome haplotypes resolve multiple and unique lineages of the Jewish priesthood. Hum. Genet. 126, 707-717, doi:10.1007/s00439-009-0727-5 (2009).

7 Tofanelli, S. et al. J1-M267 Y lineage marks climate-driven pre-historical human displacements. Eur. J. Hum. Genet. 17, 1520-1524, doi:10.1038/ejhg.2009.58 (2009).

8 Klyosov, A. A. A comment on the paper: Extended Y chromosome haplotypes resolve multiple and unique lineages of the Jewish Priesthood by M.F. Hammer, D.M. Behar, T.M. Karafet, F.L. Mendez, B. Hallmark, T. Erez, L.A. Zhivotovsky, S. Rosset, K. Skorecki, Hum Genet, published online 8 August 2009. Hum. Genet. 126, 719-724; author reply 725-716, doi:10.1007/s00439-009-0739-1 (2009).

Nanopore Updates from David Eccles

Reader David Eccles has been kind enough to keep us regularly posted on his progress with Oxford Nanopore sequencing. To make sure his helpful posts do not get lost in the comment section, we are creating a separate post as a collection of all his comments at different places in our blog. Readers are especially encouraged to see the last two comments, where he presented analysis of his own data and R codes for others to play with the data. David, please feel free to add additional comments to this thread so that others can follow them in order.

His first comment on 9/9/2014

As a participant of the MinION access programme (MAP), I have agreed to not report on technical information on the MinION sequencer (and specifically the control lambda DNA samples sent to us by Oxford Nanopore) until our laboratory has decided that the results we’re getting from the device on our own data are acceptable to us. That’s the main reason there’s such a dearth of information at the moment, and the main reason why it’s difficult for me to discuss the details of sequencing.

Invitations to the programme were given in February 2014 (i.e. this year), and the registration process (including paying money and validating that software was working okay) was complete for most people by May 2014. While we were initially told this would be a 6-week process, it has stretched out to be much longer than that, and it is only now that the second wave of participants are coming into the programme.

I expect that we were one of the last labs in the first wave of participants to receive our first sample kits, and we’ve had issues with sample preparation such that we haven’t yet had an acceptable run with our own data (mouse mitochondria). Our lab in New Zealand had shipping issues due to various reasons, and we weren’t able to do our first run until around mid July 2014.

Since then, there have been only a couple of dribbles of “data” from people who have self-certified and consider the MinION good enough for their uses. The most recent that I’m aware of is a video recording that Nick Loman made of a 2-hour sequencing run. You can see the final read length histogram here:

The output files from that run has yet to be released, but it is possible to see from the video the following statistics:

[numbers are estimated from the event data produced by the software, and the number of mappable bases will be less than this]

About 45 Mbp of sequence were generated over the course of the run. The device produced a little under one read per second, so the mean [estimated] read length is about 7kbp. To reduce confusion slightly, the histogram shown in the MinKNOW program here is adjusted so that the bin height displays the summed length of sequences in each bin, rather than the number of sequences. As an example of interpretation for this run, sequence data was mostly generated from sequences that were about 20kbp long.

I’m happy to answer questions about this, but be aware that I am a little bit restricted in that, and can only report on *technical* information that has been produced and release by other self-certified labs.

As an aside, what seems to be missed from much of this discussion is just how different this technology is. I consider it a disruptive technology, and as such, most of the applications for the MinION sequencer have yet to be discovered. Here’s my rough summary of the human-visible parts:

The MinION is a small disposable device about the size of a muesli bar (or granola bar) that plugs into the USB port of a computer, with power for sequencing and fluidics provided by the computer and your pipette. Once samples are loaded onto the device, it will produce a continuous stream of electrical sensor data in real-time that is processed on the computer to identify DNA bases that are moving through each of the 512 pores. Based on a couple of published/released graphs that I’ve seen (from Nick Loman’s lab), the sequencer is producing 10-25M sequence events per hour. These events are converted by base calling software (currently uploaded to and downloaded from a cloud service) into base-space sequence, resulting in a near real-time production of generated sequence.

Unlike other DNA sequencers, there is no DNA amplification required to run samples on the MinION. Sample preparation is reduced to mixing by inversion, spinning down liquids using a fist-sized microcentrifuge, pipetting, and room-temperature incubation to carry out DNA processing reactions — all stuff that could be done in an hour or so in the back of a Jeep.

Unlike other DNA sequencers, the MinION carries out sequencing by directly observing the DNA bases as they move through the detection pore; the sequencing process is observing the exact template strand of interest, rather than a synthesised copy. An electrical current is passed across the pore, and this changes depending on what type of base is moving through the pore at the time. The MinION produces a real-time trace of electrical signals for each pore, allowing the run to be stopped as soon as an interesting sequence is found, then re-loaded with a new sample and restarted.

Unlike other DNA sequencers, there is no limit to the length of sequences that are fed into the MinION [although sequences that are too long can get broken up by pipetting, so the recommended upper limit under standard use is about 50kb]. Instead of fragmenting DNA into hundreds or thousands of pieces prior to sequencing, much longer sequences can be used, allowing sequencing across regions of duplication or low complexity.


His second comment on the same day

> The technology has many similarities with Pacbio except that the readout here is electrical instead of optical and the sequencing instrument is tiny rather than huge.

I wouldn’t discount the actual sequencing process. To reiterate one of my previous points, “the MinION carries out sequencing by directly observing the DNA bases as they move through the detection pore.” PacBio uses fluorescently-tagged nucleotides for sequencing, and requires strand synthesis to occur in order to determine sequence — it’s still sequencing by synthesis. That means you can’t tell the precise nature of the template strand because you have to plan in advance what to expect. This is not a particularly interesting distinguishing feature at the moment, but I expect it to become more useful in the future as others determine how different base modifications (or different bases) influence the pore signal.

Ignoring that, The “signal” part of the technologies comes down to another optical vs electrical fight (like we’ve already seen with IonTorrent vs MiSeq). All optical methods (so far) are a bit more complicated because they require modified fluorescently-tagged (or radiolabelled) nucleotides in order to carry out sequencing.

> So, the obvious question would be how the signal quality of electrical readout is better than the optical readout that PacBio has.

This is, obviously, delving into technical details. At the moment all I feel comfortable with saying is that the quality of what Oxford Nanopore are able to produce internally is higher than what is being observed in the field. This is expected, and the whole point of the MAP — Oxford Nanopore want to work out what needs to be done in order to get others to see the kind of results that they are seeing in-house. If you’re interested in getting more detailed answers (or at least interested in asking similar questions), I suggest you attend or review the Google hangout that’s happening in about 12 hours (not wonderful timing for you, unfortunately):

Clive Brown is very aware of your post, and is likely to try to address at least some of the point you have raised in that hangout.


Third comment on 9/11/2014

Doing a kmer analysis is a good idea… I’ll have to add that to my standard workflow. However it doesn’t quite tell the full story about the error profile. Because the nanopore base calling is event/time based in an asynchronous fashion (like PacBio), it’s very prone to extra events and skipped events (i.e. INDELs).

Illumina has a much greater advantage here because they use reversible terminators for sequencing, which usually means that it’s one base sequenced per cycle for all spots on the flow cell. When the “usually” fails (i.e. failing to add a base, or adding too many bases), you get phasing errors and a drop in quality (hence the upper limit on sequencing length due to the accumulation of phasing errors).

Nanopore doesn’t have issues with length — the fifth base call is just as good (or bad) as the fifty thousandth, but you get less reliability because the template doesn’t progress through the pore at a constant speed. A combination of better base calling (i.e. software) and better motor enzymes (i.e. wetware) might reduce those errors in the future, but I expect they’ll never go away entirely.


On 9/10/2014

Okay, so now that we’re self-certified, here’s my take on it. Here’s a base call frequency plot:


Each base call at each location is plotted in a different colour, and the reference base is circled. It demonstrates what ~25% base-call error looks like when the alternate error is fairly random — errors for most non-reference bases are sitting at around the 5% level. There is a little deletion bias, but for SNPs the genotyping already works pretty well.

What I find very interesting is that almost all of this error is a software problem, and you can see quite clearly where the base-calling algorithms are lacking by generating raw trace data together with your run. In other words, the flow cell is great, the chemistry is great, the default sampling resolution is sufficient, but the base calling and event detection algorithms are a bit too immature to identify bases with high reliability.

I’m still cleaning up my plots for that, but I should have something to show in a week or so.

You may also be interested in these probability tables (from a different run than that pictured, but similar results):

Reference Base Probability [sequenced base is the left-most column]
A 0.95 0.022 0.014 0.014
C 0.055 0.877 0.021 0.048
G 0.097 0.069 0.743 0.09
T 0.02 0.035 0.016 0.929
i 0.366 0.239 0.116 0.279
d 0.342 0.287 0.124 0.247

Sequenced Base Probability [reference base is the left-most column]
A C G T i d
A 0.844 0.034 0.038 0.015 0.019 0.049
C 0.029 0.814 0.04 0.038 0.019 0.06
G 0.035 0.037 0.826 0.034 0.017 0.051
T 0.015 0.036 0.043 0.846 0.018 0.043


On 18/10/2014

While I’m still working out the best way to display the raw signal data, I’ve finally been able to bundle it up into something that should be suitable for graphing. Due to file sizes, I currently only hava a file containing the event data + signal data for the first 6 reads from one of our experiments, but it should be good enough to get a feel for the data:!zZAiWR7A!gF86n5f2YAHmaLsQTnMswA

The data is stored as an R data object (load using the ‘load’ function of R), and contains the following information:

* mtDNA reference sequence
* base-called sequences
* read type (template / complement / 2D)
* channel# / read# / file#
* event file name
* scaling information for adjusting between raw signal and event data
* event data
* raw signal ranges corresponding to the event data
* raw signal

I have excluded my mapping from base-called to reference sequence, because mapping is a very subjective thing.

ASHG/GA4GH Special – A Recap of Interesting NGS Algorithms of the Year

If you find our commentaries useful, please feel free to press the donate button on the right sidebar and your name will be included in the table at the bottom of the page. We thank Pawel Osipowski, a PhD student from Poland, for making today’s generous contribution to support our blog.


Section 1 covers algorithms for short reads (primarily Illumina). The reads are clean and the main problem is the volume of data. The size of each library can be 50-100G and imagine doing that for 10,000 human genomes together. The categories for challenging problems are – base-calling, kmer counting (the most basic step), de Bruijn graph construction, genome assembly, BWT construction of large library, compression of large libraries, searching through many such genomes and expression analysis.

Section 2 covers long noisy reads primarily from PacBio, but should also be equally applicable for nanopore data. The challenge here is how to take care of the noise in the reads.

Section 3 covers GPU acceleration, which is one of the ways to go over the data analysis bottleneck.

Section 4 asks the question of whether it is really necessary to keep scaling up from 10K genomes to 100K genomes and so on, or can we answer biological questions by trying a different route.

Section 5 mentions two new blogs started this year by skilled bioinformaticians.

Section 6 adds a number of interesting papers missed in the first cut. We went through all issues of arxiv/biorxiv and tried to find every interesting paper.

Enjoy and feel free to comment, distribute, criticize, flame us or whatever !!


1. Algorithms for Short Reads

KMC – Minimizer-based Kmer Counting Algorithm for Short Reads

KMC 2: Fast and resource-frugal k-mer counting

Motivation: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.

Results: We present a novel method for k-mer counting, on large datasets at least twice faster than the strongest competitors (Jellyfish~2, KMC~1), using about 12\,GB (or less) of RAM memory. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k,x)-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC~2 allows to count the 28-mers of a human reads collection with 44-fold coverage (106\,GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD.

We also discussed another efficient kmer counting program – scTurtle – in our blog.

BCALM – Chikhi et al.’s Algorithm for dBG Construction using Minimizer

De Novo Assembly of Human Genome with Only 1.5 GB RAM

The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these types of approaches. We further design and implement a general data structure (DBGFM) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of DBGFM, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use DBGFM.

Kraken – Minimizer-based Algorithm for Metagenome Classification

Kraken: Ultrafast Metagenomic Sequence Classification Using Exact Alignments

Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at

BWT Construction for Large Short-read Library

Heng Li’s ropebwt2 paper.

Fast construction of FM-index for long sequence reads

We mentioned Heng Li’s ropebwt2 code at github in the past, but now he discloses the algorithm in a brief paper posted at arxiv. We will spend the rest of the day trying to understand what is going on.

Summary: We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one that practically works for reads of averaged kilobases in length. Availability and implementation: Contact:

Is constructing FM index still relevant in the ‘Minimizer’ world?

Rayan Chikhi worked through an example here that you will find helpful.

Also check –

BWT Construction Parallelized in ‘Parabwt’

BEETL-fastq: a searchable compressed archive for DNA reads

FASTQ is a standard file format for DNA sequencing data which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Text compression tools such as gzip are often employed to reduce the storage burden, but have the disadvantage that the data must be decompressed before it can be used.

Here we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip, but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.
We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000, a million of 30-mers in respectively 3, 8, 14, 45 and 567 seconds plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and “in silico pull-down” experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.
BEETL-fastq is part of the BEETL library, available as a github repository at

Compression of Large Libraries

BARCODE – Fast lossless compression via cascading Bloom filters

BMC Bioinformatics paper

Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.

We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.

Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested
compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.

Minimizer Success – Disk-based Genome Sequencing Data Compression

We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Most compressors for such data, usually in FASTQ format, cannot efficiently compress the DNA stream if a given dataset is large, since the redundancy between overlapping reads cannot be easily exploited in the main memory, often a few times smaller than the amount of data to compress. More interesting solutions for this problem are disk-based~\cite{Y2011,CBJR2012}, where the better of these two, from Cox~et al.~\cite{CBJR2012}, is based on the Burrows–Wheeler transform (BWT) and achieves 0.484\,bits per base for a 135.3\,Gbp human genome sequencing collection with 45-fold coverage. Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.376\,bits per base as the compression ratio.

Searching from Many Genomes

Check this development from David Haussler’s group.

David Haussler and colleagues had been working on how to align a new read library against thousands of human genomes. The easy solution of performing the same alignment thousand times does not scale well. Benedict Paten, Adam Novak and David Haussler posted a new paper in arxiv to address this question. This appears like an interesting paper, but we have not read it carefully to understand the innovation beyond what Juni Siren (corrected) and others proposed before.

To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for mapping each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for mapping to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.

Readers may also find the following work from Haussler group relevant –

HAL: a Hierarchical Format for Storing and Analyzing Multiple Genome Alignments

SPAdes for Genome Assembly

When it comes to genome assembly, SPAdes group stands way apart from others by publishing many new developments every year.

SPAdes Group’s Long-awaited Repeat Resolver Paper is Published

dipSpades Beats Haplomerger Hands Down in Diploid Assembly

Also, Pavel Pevzner, who leads this group, published an interesting paper on manifold de Bruijn graph. We are not sure whether it is incorporated in SPAdes yet.

Manifold de Bruijn Graphs – Yu Lin and Pavel A. Pevzner

Genome assembly is usually abstracted as the problem of reconstructing a string from a set of its k-mers. This abstraction naturally leads to the classical de Bruijn graph approach—the key algorithmic technique in genome assembly. While each vertex in this approach is labeled by a string of the fixed length k, the recent genome assembly studies suggest that it would be useful to generalize the notion of the de Bruijn graph to the case when vertices are labeled by strings of variable lengths. Ideally, we would like to choose larger values of k in high-coverage regions to reduce repeat collapsing and smaller values of k in the low-coverage regions to avoid fragmentation of the de Bruijn graph. To address this challenge, the iterative de Bruijn graph assembly (IDBA) approach allows one to increase k at each iterations of the graph construction. We introduce the Manifold de Bruijn (M-Bruijn) graph (that generalizes the concept of the de Bruijn graph) and show that it can provide benefits similar to the IDBA approach in a single iteration that considers the entire range of possible k-mer sizes rather than varies k from one iteration to another.

BESST for Scaffolding During Genome Assembly


BESST is a package for scaffolding genomic assemblies. It contains several modules for e.g. building a “contig graph” from available information, obtaining scaffolds from this graph, and accurate gap size information (based on GapEst For installation, see docs/INSTALL.txt.

Sailfish for RNAseq Expression Analysis

For RNAseq expression profiling, Sailfish reduced the time of analysis by doing it straight from kmers instead of reads.

Check –

Goodbye RSEM, Sailfish Paper Published

We wrote about very fast Sailfish algorithm, when it came out in the arxiv (DEseq and Sailfish Papers for RNAseq). Also, one of our readers liked the code and we wrote about it as well –

Good C++ Development Practices in Sailfish Code

Now the readers can find the published paper in Nature Biotech. There is no excuse for not using it.

Blindcall for Base-calling

BlindCall: Ultra-fast Base-calling Algorithm Using Blind Deconvolution

Motivation: Base-calling of sequencing data produced by highthroughput
sequencing platforms is a fundamental process in current bioinformatics analysis. However, existing third-party probabilistic or machine-learning methods that significantly improve the accuracy of base-calls on these platforms are impractical for production use due to their computational inefficiency.

Results: We directly formulate base-calling as a blind deconvolution problem and implemented BlindCall as an efficient solver to this inverse problem. BlindCall produced base-calls at accuracy comparable to state-of-the-art probabilistic methods while processing data at rates ten times faster in most cases. The computational complexity of BlindCall scales linearly with read length making it better suited for new long-read sequencing technologies.

Availability and Implementation: BlindCall is implemented as a set of Matlab scripts available for download at

2. Pacbio Reads

DALIGN – Pacbio Aligner

In DALIGN Paper, Gene Myers Delivers a Major Blow to His Biggest Competitor

It is a very neat paper, because Gene Myers showed here a way to speed up alignment algorithms without going through BWT construction. It stands out of the norm from most other NGS alignment algorithms.

MHAP – Adam Philippy

After HGAP And SPAdes Comes New PacBio Assembler – MHAP

Hybrid Assembly

Very Efficient Hybrid Assembler for PacBio Data

Reader Chengxi Ye mentioned to us a few months back that he developed a very efficient genome assembler for PacBio data, which did PacBio assembly of human genome in 6 hours on a desktop computer compared to >400K CPU hours using a cluster. For those, who do not know Chengxi Ye, he was the author of SparseAssembler and also developed ultrafast basecaller BlindCall. The algorithm of SparseAssembler is more relevant in the current context, because he leveraged its compression scheme in the new PacBio/hybrid assembler.

3. Hardware Acceleration – GPU

This year, Professor T.-W. Lam’s group at HKU published a number of papers on hardware acceleration of bioinformatics algorithms using GPU. Although they are mostly for short reads, we make a separate category here.

BWT Construction using GPU

This new preprint from Chi-Man Liu, Ruibang Luo and Tak-Wah Lam is worth checking out (h/t: @lexnederbragt) !! BWT construction is a major problem for large NGS read files, and as we discussed in “What Problems are Being Solved by the Top NGS Bioinformaticians Today?”, almost every top bioinformatician has a BWT-construction project going on. There are several benefits of turning read files into BWT quickly.

(i) faster compression and easy transfer (Tony Cox’s rearranging of reads is a good example),

(ii) quick search through a large set of reads,

(iii) k-mer counting is an equivalent problem as BWT construction (see – “BWT Construction and K-mer Counting”),

(iv) genome assembly using SGA-type of algorithm.

BALSA for Variant Calling using GPU

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 hours to process 50-fold whole genome sequencing (~750 million 100bp paired-end reads), or just 25 minutes for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses.

MEGAHIT – Metagenome Assembly using GPU

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it avoids pre-processing like partitioning and normalization, which might compromise on result integrity. MEGAHIT generates 3 times larger assembly, with longer contig N50 and average contig length than the previous assembly. 55.8% of the reads were aligned to the assembly, which is 4 times higher than the previous. The source code of MEGAHIT is freely available at this https URL under GPLv3 license.


4. Future

What should one do with so many efficient algorithms? The conventional answer is to cure human diseases, but we believe the companies are over-investing in this space. Our proposal is presented in the following commentaries.

Bioinformatics at a Crossroad Again – Which Way Next?

Crossroads (ii) – Is It Too Late to Acknowledge that Systems Biology and GWAS Failed?

Crossroads (iii) – a New Direction for Bioinformatics with Twelve Fundamental Problems

To spare you from reading three long posts, here is a summary. We believe the new algorithms will be useful in solving fundamental problems in biology, and that process will eventually lead to curing health or other practical problems. The current attempts are too superficial.

To solve fundamental problems, the biologists and computer scientists need to meet halfway and learn quite a bit about each other’s field. That is the direction in which we like to take our blog from next year. All other suggestions are welcome.

5. New blogs

A number of new bioinformatics blogs came online in 2014. Here are a few noteworthy ones.

Gene Myers’ Dazzler Blog

James Knight’s Blog at Yale

If you think we missed any noteworthy contribution (including yours), please feel free to mention in the comment section.


Edit. 6. New Additions

After completing this commentary, we went to biorxiv to see whether we missed any other interesting development. The following papers appear noteworthy. We cannot guarantee that all of them are excellent, and welcome readers’ comment to spare us from reading so many papers.

Lighter: fast and memory-efficient error correction without counting

It is from the same group that developed Jellyfish and Sailfish.

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Compression of short-read sequences using path encoding

Another one from the same group.

Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing. Our approach offers a new direction by sitting between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs — a common task in genome assembly — and context-dependent arithmetic coding. Supporting this method is a system, called a bit tree, to compactly store sets of kmers that is of independent interest. Using these techniques, we are able to encode RNA-seq reads using 3% — 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than recent competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

Faster sequence alignment through GPU-accelerated restriction of the seed-and-extend search space

Salzberg and colleagues going into GPU space.

Motivation: In computing pairwise alignments of biological sequences, software implementations employ a variety of heuristics that decrease the computational effort involved in computing potential alignments. A key element in achieving high processing throughput is to identify and prioritize potential alignments where high-scoring mappings can be expected. These tasks involve list-processing operations that can be efficiently performed on GPU hardware. Results: We implemented a read aligner called A21 that exploits GPU-based parallel sort and reduction techniques to restrict the number of locations where potential alignments may be found. When compared with other high-throughput aligners, this approach finds more high-scoring mappings without sacrificing speed or accuracy. A21 running on a single GPU is about 10 times faster than comparable CPU-based tools; it is also faster and more sensitive in comparison with other recent GPU-based aligners.

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Paolo Ferragina usually does innovative work. Worth taking a look.

This paper reports an initial design of new data-structures that generalizes the idea of pattern- matching in stringology, from its traditional usage in an (unstructured) set of strings to the arena of a well-structured family of strings. In particular, the object of interest is a family of strings composed of blocks/classes of highly similar “stringlets,” and thus mimic a population of genomes made by concatenating haplotype-blocks, further constrained by haplotype-phasing. Such a family of strings, which we dub “stringomes,” is formalized in terms of a multi-partite directed acyclic graph with a source and a sink. The most interesting property of stringomes is probably the fact that they can be represented efficiently with compression up to their k-th order empirical entropy, while ensuring that the compression does not hinder the pattern-matching counting and reporting queries – either internal to a block or spanning two (or a few constant) adjacent blocks. The solutions proposed here have immediate applications to next-generation sequencing technologies, base-calling, expression profiling, variant-calling, population studies, onco-genomics, cyber security trace analysis and text retrieval.

Using 2k + 2 bubble searches to find SNPs in k-mer graphs

Single Nucleotide Polymorphism (SNP) discovery is an important preliminary for understanding genetic variation. With current sequencing methods we can sample genomes comprehensively. SNPs are found by aligning sequence reads against longer assembled references. De Bruijn graphs are efficient data structures that can deal with the vast amount of data from modern technologies. Recent work has shown that the topology of these graphs captures enough information to allow the detection and characterisation of genetic variants, offering an alternative to alignment-based methods. Such methods rely on depth-first walks of the graph to identify closing bifurcations. These methods are conservative or generate many false-positive results, particularly when traversing highly inter-connected (complex) regions of the graph or in regions of very high coverage. We devised an algorithm that calls SNPs in converted De Bruijn graphs by enumerating 2k + 2 cycles. We evaluated the accuracy of predicted SNPs by comparison with SNP lists from alignment based methods. We tested accuracy of the SNP calling using sequence data from sixteen ecotypes of Arabidopsis thaliana and found that accuracy was high. We found that SNP calling was even across the genome and genomic feature types. Using sequence based attributes of the graph to train a decision tree allowed us to increase accuracy of SNP calls further. Together these results indicate that our algorithm is capable of finding SNPs accurately in complex sub-graphs and potentially comprehensively from whole genome graphs. The source code for a C++ implementation of our algorithm is available under the GNU Public Licence v3 at:

KmerStream: Streaming algorithms for k-mer abundance estimation

Paul Melsted.

Motivation: Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k-mers (substrings of length k). Histograms of k-mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment. Results: We present KmerStream, a streaming algorithm for computing statistics for high throughput sequencing data based on the frequency of k-mers. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. This very low space requirement allows us to deal with much larger datasets than previously presented algorithms. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream and validate the accuracy on sequences from a PhiX control. As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values. Availability: The tool KmerStream is written in C++ and is released under a GPL license. It is freely available at

SplitMEM: Graphical pan-genome analysis with suffix skips

Michael Schatz.

Motivation: With the rise of improved sequencing technologies, genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach with multiple individuals represented and analyzed together. One of the most sophisticated data structures for representing an entire population of genomes is a compressed de Bruijn graph. The graph structure can robustly represent simple SNPs to complex structural variations far beyond what can be done from linear sequences alone. As such there is a strong need to develop algorithms that can efficiently construct and analyze these graphs. Results: In this paper we explore the deep topological relationships between the suffix tree and the compressed de Bruijn graph. We introduce a novel O(n log n) time and space algorithm called splitMEM, that directly constructs the compressed de Bruijn graph for a pan-genome of total length n. To achieve this time complexity, we augment the suffix tree with suffix skips, a new construct that allows us to traverse several suffix links in constant time, and use them to efficiently decompose maximal exact matches (MEMs) into the graph nodes. We demonstrate the utility of splitMEM by analyzing the pan- genomes of 9 strains of Bacillus anthracis and 9 strains of Escherichia coli to reveal the properties of their core genomes. Availability: The source code and documentation are available open- source at

MUSIC: A Hybrid Computing Environment for Burrows-Wheeler Alignment for Massive Amount of Short Read Sequence Data

High-throughput DNA sequencers are becoming indispensable in our understanding of diseases at molecular level, in marker-assisted selection in agriculture and in microbial genetics research. These sequencing instruments produce enormous amount of data (often terabytes of raw data in a month) that requires efficient analysis, management and interpretation. The commonly used sequencing instrument today produces billions of short reads (upto 150 bases) from each run. The first step in the data analysis step is alignment of these short reads to the reference genome of choice. There are different open source algorithms available for sequence alignment to the reference genome. These tools normally have a high computational overhead, both in terms of number of processors and memory. Here, we propose a hybrid-computing environment called MUSIC (Mapping USIng hybrid Computing) for one of the most popular open source sequence alignment algorithm, BWA, using accelerators that show significant improvement in speed over the serial code.

Alignment-free comparison of next-generation sequencing data using compression-based distance measures

Enormous volumes of short reads data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison.
The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly.
Thus alignment-free methods need to be developed for the comparison of NGS samples of short reads.
Recently, new k-mer based distance measures such as {\it CVTree}, dS2, {\it co-phylog} have been proposed to address this problem.
However, those distances depend considerably on the parameter k, and how to choose the optimal k is not trivial since it may depend on different aspects of the sequence data.
Hence, in this paper we consider an alternative parameter-free approach: compression-based distance measures.
These measures have shown impressive performance on long genome sequences in previous studies, but they have not been tested on NGS short reads.
In this study we perform extensive validation and show that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the alignment-based approach, and from existing benchmarks in the literature.
Moreover, as these measures are parameter-free, no optimization is required and they still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels.
The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genome sequences and NGS samples of short reads.

Linear Recognition of Almost Interval Graphs

Yixin Cao

Let interval+kv, interval+ke, and interval−ke denote the classes of graphs that can be obtained from some interval graph by adding k vertices, adding k edges, and deleting k edges, respectively. When k is small, these graph classes are called almost interval graphs. They are well motivated from computational biology, where the data ought to be represented by an interval graph while we can only expect an almost interval graph for the best. For any fixed k, we give linear-time algorithms for recognizing all these classes, and in the case of membership, our algorithms provide also a specific interval graph as evidence. When k is part of the input, these problems are also known as graph modification problems, all NP-complete. Our results imply that they are fixed-parameter tractable parameterized by k, thereby resolving the long-standing open problem on the parameterized complexity of recognizing interval+ke, first asked by Bodlaender et al. [Bioinformatics, 11:49--57, 1995]. Moreover, our algorithms for recognizing interval+kv and interval−ke run in times O(6k⋅(n+m)) and O(8k⋅(n+m)), (where n and m stand for the numbers of vertices and edges respectively in the input graph,) significantly improving the O(k2k⋅n3m)-time algorithm of Heggernes et al. [STOC 2007] and the O(10k⋅n9)-time algorithm of Cao and Marx [SODA 2014] respectively.

Indexing large genome collections on a PC

Agnieszka Danek, Sebastian Deorowicz, Szymon Grabowski

Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements.
Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16–32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,μs and with up to 3 mismatches in 373\,μs on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,μs and 917\,μs.
Availability: Software and Suuplementary material: \url{this http URL}.

SAMBLASTER: fast duplicate marking and structural variant read extraction

Gregory G. Faust, Ira M. Hall

Motivation: Illumina DNA sequencing is now the predominant source of raw genomic data, and data volumes are growing rapidly. Bioinformatic analysis pipelines are having trouble keeping pace. A common bottleneck in such pipelines is the requirement to read, write, sort and compress large BAM files multiple times.
Results: We present SAMBLASTER, a tool that reduces the number of times such costly operations are performed. SAMBLASTER is designed to mark duplicates in read-sorted SAM files as a piped post-pass on DNA aligner output before it is compressed to BAM. In addition, it can simultaneously output into separate files the discordant read-pairs and/or split-read mappings used for structural variant calling. As an alignment post-pass, its own runtime overhead is negligible, while dramatically reducing overall pipeline complexity and runtime. As a stand-alone duplicate marking tool, it performs significantly better than PICARD or SAMBAMBA in terms of both speed and memory usage, while achieving nearly identical results.
Availability: SAMBLASTER is open source C++ code and freely available from this https URL

Constructing String Graphs in External Memory

In this paper we present an efficient external memory algorithm to compute the string graph from a collection of reads, which is a fundamental data representation used for sequence assembly.
Our algorithm builds upon some recent results on lightweight Burrows-Wheeler Transform (BWT) and Longest Common Prefix (LCP) construction providing, as a by-product, an efficient procedure to extend intervals of the BWT that could be of independent interest.
We have implemented our algorithm and compared its efficiency against SGA-the most advanced assembly string graph construction program.

Crossroads (iii) – a New Direction for Bioinformatics with Twelve Fundamental Problems

This is the last post of a series of three commentaries. Previous ones can be seen at the following links.

Bioinformatics at a Crossroad Again – Which Way Next?

Crossroads (ii) – Is It Too Late to Acknowledge that Systems Biology and GWAS Failed?

The series started, when one of our readers asked about the remaining hard problems in bioinformatics. This reader is an expert in genome assembly, who contributed to development of a number of cutting-edge algorithms. The conventional answer would have been to suggest working on transcriptome assembly or metagenome assembly or developing more user-friendly version of Galaxy, or all of the above for big data, BIGGER data, BIGGEST DATA and never-ending iterations thereof.

However, we refrained from giving such an answer, because around the time the question came, we finished our electric fish paper and felt that a large part of contemporary bioinformaticians follow a false paradigm guided by unproductive motivations. We explained what we meant by ‘false paradigm’ in previous commentaries and will elaborate further here.

We arrived at above conclusion after closely following many bioinformatics papers, conference presentations, blog posts and discussions over the last few years. In our opinion, this false paradigm creates disconnect and conflicts between traditional biologists and those who are trying to analyze large genomic data sets. Examples of such conflicts can be seen in unflattering remarks at various places on ENCODE, GWAS, eQTL and other expensive misadventures.


Entry of Computer Scientists into Biology

Influx of computer scientists into biology started in late 1990s, when the human genome project neared its end. Proliferation of inexpensive sequencing technology and other associated large-scale methods (microarray, chip-ChIP, etc.) created and continues to create such a need. We often speak with many of those researchers over email, and find them to be very smart, trying to bridge the gap between two distant fields.

The contemporary era has many similarities with late 1940s, when several physicists came into biology and brought in their expertise in X-ray crystallography and electron microscopy. Within a short period, that group established itself as the leaders of biological sciences. All physicists associated with Max Perutz in Cavendish lab and later ‘Laboratory of Molecular Biology’ made major breakthroughs in biological sciences. While reviewing the history of this era, we also noticed that their work brought together previously unconnected fields – biochemistry, genetics and evolutionary biology.

In contrast, the scientific contributions of various expensive bioinformatics-heavy projects of the current era had been marginal. ENCODE and modENCODE now take pride in their ability to collect data, but data collection was the trivial part of their original plan. Many expensive GWAS studies had been complete failures explaining only 2-3% of complex traits. It is unfortunate that this failure had been predicted by Ken Weiss in 2000, based on his previous experiences, but promoters of GWAS still do not pay attention to his arguments. Instead, to hide the failure, a recent GWAS study used statistical tricks and claimed to explain over 60% variations in height of all human beings. Ken Weiss has taken their argument to woodsheds in three excellent blog posts.

The height of folly: are the causes of stature numerous but at least ‘finite’?

Hiding behind technicalities to avoid having to change

Morgan’s insight–and Morgan’s restraint


Continue reading Crossroads (iii) – a New Direction for Bioinformatics with Twelve Fundamental Problems

Very Efficient Hybrid Assembler for PacBio Data

Reader Chengxi Ye mentioned to us a few months back that he developed a very efficient genome assembler for PacBio data, which did PacBio assembly of human genome in 6 hours on a desktop computer compared to >400K CPU hours using a cluster. For those, who do not know Chengxi Ye, he was the author of SparseAssembler and also developed ultrafast basecaller BlindCall. The algorithm of SparseAssembler is more relevant in the current context, because he leveraged its compression scheme in the new PacBio/hybrid assembler.

The paper is posted in arxiv and the programs can be downloaded from here.

DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph

The genome assembly computational problem is preventing the transition from the prevalent second generation to third generation sequencing technology. The problem emerged because the erroneous long reads made the assembly pipeline prohibitive expensive in terms of computational time and memory space consumption.

In this paper, we propose and demonstrate a novel algorithm that allows efficient assembly of long erroneous reads of mammalian size genomes on a desktop PC. Our algorithm converts the de novo genome assembly problem from the de Bruijn graph to the overlap layout consensus framework. We only need to focus on the overlaps composed of reads that are non-contained within any contigs built with de Bruijn graph algorithm, rather than on all the overlaps in the genome data sets. For each read spanning through several contigs, we compress the regions that lie inside each de Bruijn graph contigs, which greatly lowers the length of the reads and therefore the complexity of the assembly problem. The new algorithm transforms previously prohibitive tasks such as pair-wise alignment into jobs that can be completed within small amount of time. A compressed overlap graph that preserves all necessary information is constructed with the compressed reads to enable the final-stage assembly.

We implement the new algorithm in a proof-of-concept software package DBG2OLC. Experiments with the sequencing data from the third generation technologies show that our method is able to assemble large genomes much more efficiently than existing methods. On a large PacBio human genome dataset we calculated the pair-wise alignment of 54x erroneous long reads of human genome in 6 hours on a desktop computer compared to the 405,000 CPU hours using a clusters, previously reported by Pacific Biosciences. The final assembly results were in comparably high quality.

Latest from Steven McKnight on Harvard/Broadster Citation Culture

A number of badly-trained technicians complained in twitter (e.g. check here) about a biochemist from UT Southwestern and compared him with Dan Graur. That is one step below comparing someone with Atilla the Hun or Genghis Khan, and we had to check what this biochemist is up to.

Based on what we found, we believe this guy is the real deal !! We agree with much of what he says and recommend you to listen to him as well. Here are the last three commentaries from Steven McKnight.

Cite me

Starting with the latter of these problems, I give a humorous example of how things have gone haywire. Several years ago my colleagues Jian Wang, Peter Alexander and I reported our discovery that mouse embryonic stem cells consume threonine as a hydrocarbon fuel. Mouse ES cells grow at an incredibly rapid clip. Their mitotic doubling time is only four to five hours in duration, exceeding that of the most rapidly growing cancer cells and approaching the doubling time of laboratory strains of yeast. Hypothesizing that this might reflect the possibility that ES cells exist in an unusual metabolic state, we measured the levels of scores of metabolites as a function of ES cell differentiation. When cued to differentiate upon withdrawal of leukemia inhibiting factor and administration of retinoic acid, the growth rate of ES cells slows from a four- to five-hour doubling time to 24 hours.

By observing profound changes in the levels of metabolites associated with one carbon metabolism, we stumbled over the fact that pluripotent ES cells express the threonine dehydrogenase, or TDH, enzyme at a 1,000-fold higher level than any mouse tissue or cell line that we tested, and they use the enzyme to consume threonine as a metabolic fuel. Depletion of threonine from the culture medium killed mouse ES cells, and specific inhibitors of the TDH enzyme kill mouse ES cells while having no effect on any other cultured cells tested to date. We recognized this to be analogous to the metabolic state of rapidly growing bacterial and yeast cells, and we made reference to microbiologists upon whose shoulders we stood.

Three years later, the Harvard University labs of Lewis Cantley and George Daley published a nice follow-up paper in the journal Science. My colleagues and I were delighted to see the replication and extension of our earlier work, yet I was mortified to see that the Harvard manuscript made reference to a paper that I had entirely missed. Yes, the Cantley/Daley paper did cite our paper, yet it also attributed the discovery of threonine dependence of mouse ES cells to a paper from the laboratory of Eric Lander of the Broad Institute. The latter paper was published a full year before our 2009 Science paper.

I immediately downloaded the Broad paper and scoured it from end to end. To my confusion, I could not find the word threonine in the entire paper, much less threonine dehydrogenase or anything to do with the metabolic state of mouse ES cells. Inquiries to Cantley and Daley led me to a postdoctoral fellow who instructed me to view the supplemental data included in the Broad paper. Sure enough, the study listed thousands of genes selectively expressed in undifferentiated ES cells, and the gene encoding threonine dehydrogenase was embedded within the list. This was an “Alice in Wonderland” moment for me, showing that an entirely new era of attribution had evolved. Any discovery that stems from any gene listed in the supplemental data of the Broad manuscript can now be attributed to that paper! Instead of thinking “Are you kidding me?”, I recognize that we are now adapting to the new reality: Citations are no longer your father’s Oldsmobile.

This is very similar to our experience with another Broadster working closely with Lander.

The curse of committees and clubs

Now, to the second of two evils: the evolution of scientific clubs. Back when we used to “walk miles to school,” the scientific meetings we attended had some level of breadth. Among all meetings I attended back in the 1970s and 1980s, the two best were the Gordon Conference on Biological Regulatory Mechanisms and the Arolla Workshop. Both were relatively small meetings, including only perhaps 100 to 150 participants. Despite the small size of these meetings, both sported an intellectually thrilling breadth of scientific scope. One might hear about mobile genetic elements in maize, mating type switching in yeast (where sirtuin proteins came from), UV-mediated release of phage lambda from its lysogeic state, or the genetics of pattern formation in fruit fly embryos. Methods of genetics, biochemistry and molecular biology were applied to zoo- or botanical garden-like distributions of animals, microbes and plants. When one left such meetings, horizons of perspective were broadened. Boy, were those meetings fun.

Fast-forward 30 years, and what do we now have? The typical modern biomedical meeting spends a week on a ridiculously thin slice of biology. There are entire meetings devoted to the hypoxic response pathway, sirtuin proteins, P53, mTor or NFkB. If a scientist studies some aspect of any of these domains, he or she absolutely has to attend these mindless meetings where, at most, some miniscule increment of advancement is all to be learned.

Damn the fool who does not attend these meetings: The consequence is failure to maintain club membership. And why is club membership of such vital importance? Yes, precisely, there is nearly a one-to-one correspondence between these clubs and CSR study sections. To think that a grant applicant would have even a prayer of winning a fundable score from a study section wherein the applicant is not a club member is to be equated with idiocy.

Whether clubs came from committees or vice versa matters not – that is where evolution of our biomedical enterprise has taken us. Upon closing out his presidency in 1960, Dwight Eisenhower offered the cautionary statement, “Beware of the military industrial complex.” I close with a similar warning: Beware of the biomedical industrial complex. In subsequent essays, I will offer ideas on how we might reverse untoward trends.

and the best of all –

Down but not out

Here is the idea on big science. Once we gather enough of it, really smart people will be able to extract all of the diamonds of biology. Magically, for example, they will be able to use big data to predict correctly that cells have an enzyme that senses intracellular DNA and then triggers the production of cyclic GAMP, which then activates the STING enzyme to mount an innate immune response. When this happens, we won’t need the biochemical skills of James Chen, who painstakingly discovered the aforementioned pathway.

Had we had access to big data science 30 years ago, we would not have needed Tom Cech’s chemical and biochemical acumen to discover catalytic RNA. Geez, would life have been simpler! The magic claw of big data science could have seen all of the discoveries of significance and simply plucked them out of the pile of massive data sets.

OK, enough foolishness. There is a place for everything, including big data science. The point I seek to make in this inaugural essay is the simple prediction that, as we peer into the looking glass of the future, Chen- and Cech-like discoveries abound. Not being a gambler, I will not short the stock of big data in anticipation of the bursting of its bubble. On the other hand, given that the market cap of mechanistic biochemistry may be at an all-time low, I could not be more bullish on our stock.

Time will tell whether big data science is just a Ponzi scheme or will instead dazzle us with magnificent discoveries. If it does, the reductionist, mechanistic approaches now out of fashion may fade into extinction. I trust that readers will see where my money is: We biochemists are down but not out.