Y
Y

## The GWAS Era is Over – Jay Shendure

Renowned geneticist Jay Shendure wrote an editorial titled ‘life after genetics’ (h/t: gholson lyon), but based on the content, it should have been titled ‘life after GWAS’. That means the GWAS era is over, or should be over soon according to Shendure.

The article is very similar to various commentaries in our blog, but presented in polite language. For example, instead of calling various large GWAS studies as abysmal failure, Shendure called them ‘successes’, yet asked for ‘shift its attention from disease gene identification to following through on next steps, most importantly pursuing the biological mechanisms underlying genotype-phenotype associations’.

The most noteworthy section of Shendure’s editorial is the following:

Fourth, although genomics provides a systematic, genome-wide means of identifying a gene or genes in which variation contributes to the pathophysiology of a given disease, understanding the role of these gene(s) inevitably requires experiments. This is ostensibly a task for biologists rather than geneticists; however, geneticists bear some degree of responsibility for ensuring that the story does not end with genetics and, as such, there should be no barriers against geneticists delving deeply into the biology of gene mechanisms. Furthermore, the number of genes implicated by genetic approaches in human phenotypes but whose biological function remains poorly understood is easily in the thousands. The armamentarium of genomic approaches for observational (for example, transcriptional profiling) and perturbational (for example, genome-wide knockdown or knockout screens) experiments may represent useful approaches for advancing our fundamental understanding of the biological role(s) of implicated genes in a scalable fashion.

Interpretation: enough of that ‘let us statisticians keep finding potential disease genes, while you biologists suffer in the lab’ crap. Go to the lab and show that your predictions are anything worth paying attention to.

Needless to say that the answer of the other side will be to redefine the word phenotype and continue wasting money in worthless large-scale studies (check – ‘Tragedy of the Day – New ‘Revealing’ Human Transcriptome Paper at Biorxiv’).

## The Most Challenging Problem in Bioinformatics

In chapter 6 of his book, Darwin wrote –

If it could be demonstrated that any complex organ existed, which could not possibly have been formed by numerous, successive, slight modifications, my theory would absolutely break down. But I can find out no such case.

Although Darwin mentioned complex organs, the statement would be equally valid for complex unicellular organisms not widely known in his time. The genome sequencing projects over the last 15 years shows one case that ‘absolutely breaks down’ the idea of numerous, successive, slight modifications to get to complexity, and that is the origin of eukaryotic organisms from prokaryotes. Eugene Koonin describes the problem as ‘enimagtic’. Dan Graur calls it ‘one of the hardest and most interesting puzzles’. The difficulty is being more and more appreciated with the availability of many different protozoic genome.

What is the challenge? It appears that a large number of eukaryotic genes came from ‘nowhere’ (check this 2002 paper – The origin of the eukaryotic cell: A genomic investigation). Moreover, every possible scenario of explaining their gradual origin over evolutionary time hits one or other roadblock. It appears that eukaryotes have a number of novel genomic toolkits and the entire package came into earliest eykaryotic cell (‘LECA’) together.

In our blog on evolution, we are collecting a number of relevant papers with short explanations about various potential conjectures, but you can see that the question is far from answered.

KOONIN: THE ORIGIN AND EARLY EVOLUTION OF EUKARYOTES IN THE LIGHT OF PHYLOGENOMICS

WHY ARE BACTERIA DIFFERENT FROM EUKARYOTES?

WHY ARE BACTERIA DIFFERENT FROM EUKARYOTES? – (II)

Can the bioinformaticians stick to developing tools and leave answering fundamental problems to others? That results in a large amount of duplicate work, and a good example in this context is the CEGMA paper. The authors did not acknowledge in their paper that the same question was answered by evolutionary biologists for many years prior to their work, and neither did they bother to ask the fundamental questions answered in the prior papers. Two noteworthy examples of uncited prior work –

The origin of the eukaryotic cell: a genomic investigation

A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes

## Eugene Myers on the Future Direction of Bioinformatics

Now that it has become rather inexpensive to sequence and assemble genomes, transcriptomes and metagenomes, the question is what to do with this advancement in technology. The conventional answer is to sequence as much as you can in various 1k, 10k and 100k genome/transcriptome/x-ome project fashion and ask questions later. This mentality led to abundance of GWAS (Genome Wide Association Studies) to determine ‘causal variants’ for all kinds of human traits. The approach failed despite inflated claims and media splash about dramatic discoveries, and the big claims were rightly criticized by knowledgeable scientists. For example, check the following two blog posts by Ken Weiss and Dan Graur –

The height of folly: are the causes of stature numerous but at least ‘finite’?

DisGWASted Again: Another #GWAS, Another Diddly Squat Accompanied by Loud Fanfare

Despite criticisms, the backers of GWAS and ‘more sequencing’ continue to claim – (i) the failure is due to small sample size, whereas their efforts will be successful with even more data, (ii) there is no alternative.

In a series of three posts (here, here and here), we showed that both claims were wrong. Borrowing from a paper by Sydney Brenner, we argued that GWAS and systems biology type of methods would not work due to flaws in fundamental scientific reasoning, whereas the alternative is to make cells, and not the genome, the center of attention. We also proposed a set of fundamental problems in biology and urged bioinformaticians to focus on solving those problems instead of building new tools for better genome assembly.

Speaking of developing genome assembly algorithms and other bioinformatics techniques, if one person had always been way ahead of time, that was Eugene Myers. In most core algorithm related to sequence alignment or assembly, you will find his significant contribution and he made those contributions in early 1990s ! He was the brain behind the company Celera, which was the first to sequence a human genome.

We sought opinion from Eugene Myers about our (and Brenner’s) suggestion about focusing on cells instead of the genome and he kindly replied –

I read the article and I pretty much concur. I work at an institute trying to understand development in terms of what *cells* can do, and I am dedicated to analyzing images and movies of what things encoded in the genome do in cells.

Let us add a bit more information to explain the above comment. Most bioinformaticians noticed that Myers took a break from sequence-related work since 2006-2007, right around the time when NGS took off. The conventional wisdom would say that he probably missed the trend, but those knowing his past record finds that he was again building the future way ahead of others. In 2009, his group worked with Stuart Kim to publish this interesting paper on C. elegans. This work brought forward Sydney Brenner’s work on C. elegans to the high-throughput age, and Myers’ group added the missing piece – computational identification of images of cells.

Analysis of Cell Fate from Single-Cell Gene Expression Profiles in C. elegans

The C. elegans cell lineage provides a unique opportunity to look at how cell lineage affects patterns of gene expression. We developed an automatic cell lineage analyzer that converts high-resolution images of worms into a data table showing fluorescence expression with single-cell resolution. We generated expression profiles of 93 genes in 363 specific cells from L1 stage larvae and found that cells with identical fates can be formed by different gene regulatory pathways. Molecular signatures identified repeating cell fate modules within the cell lineage and enabled the generation of a molecular differentiation map that reveals points in the cell lineage when developmental fates of daughter cells begin to diverge. These results demonstrate insights that become possible using computational approaches to analyze quantitative expression from many genes in parallel using a digital gene expression atlas.

More details are available here

Exploring Cells & Systems via Image Analysis and Customized Microscopy

We are best known for BLAST and the whole-genome shotgun protocol and assembler — accomplishments in traditional sequence-based bioinformatics. But since 2002 the group has focused almost exclusively on analyzing and extracting information from images obtained by various forms of microscopy. We believe that such data will reveal more about the function of the entities encoded in the genome then any other approach and will eventually become a prevailing paradigm of investigation, like sequence-based discovery is today. The group has even begun to develop its own customized microscopes.

Since 2002 we have worked on the following representative problems among others:
Building a cell model of a C.elegans L1 larvae and software to automatically extract cell-by-cell expression levels in situ.
Building models of the fly nervous system using light microscopy. By stochastically expressing GFP in every individual neuron in this brain, we expect to deliver a detailed view of the highly stereotyped neuronal connectivity of the fly brain.
Tracking and measuring the vibrissae of a mouse while it is performing a cognitive function.
Tracking microtubules and centrosomes during the first few cell divisions of the embryogenesis of C. elegans.
Construction of an ultra-fast, block-face multi-photon microscope with onboard microtome for imaging a mouse brain with 1/2-micron pixels in less than a week.

In broad terms, the computational challenges are to build canonical 3D models of biological systems and map molecular observations onto the model. That is, there is the challenge of registering observations from different animals into a single representative framework, and the challenge of extracting meaningful information in the presence of low SNR and diffraction-limited resolution. The offsetting factor to low SNR and resolution is the presence of very strong prior knowledge about the morphology and dynamics of the entities under observation. Many of our new techniques thus involve what are called template-driven approaches.

We are in essence a technology group. Our aim is to develop microscopes and software that make observations of in situ and in vivo systems that enable our collaborative partners to advance molecular and cellular biology. We have worked closely with Stuart Kim (Stanford), Chris Doe (Oregon), Tony Hyman (MPI-CBG), Karel Svoboda, Gerry Rubin, Jim Truman, and Tzumin Lee (HHMI JFRC) as examples. The figure below illustrates imagery from three of our projects.

The overarching goal of our group is to build optical devices, collect molecular reagents, and develop analysis software to monitor in as much detail as possible the concentration and localization of proteins, transcripts, and other entities of interest within a developing cohort of cells for the purpose of working together with other groups in the Dresden area towards a biophysical understanding of development at the level of cell communication and force generation.

## Personalized Cancer Screening Will Lead to Many False Diagnosis of Blood Cancer

Here is an interesting paper that calls into question many commonly accepted notions about GWAS and ‘personalized medicine’. The authors studied somatic mutations in blood samples of cancer patients not known to have leukemia or lymphoma, and found many of them to carry mutations linked with blood cancer.

——————————————————————

Somatic mutations

It is well-known that our genome is not static, and the cells in the body of every individual accumulate mutations over her lifetime. Ken Weiss often points that out in his blog post in discussing the limitations of genome-based comparative studies for predicting common traits or disease risks. From his recent series –

What if Rev Jenyns had agreed? Part I. Would evolutionary theory be different?

What if Rev Jenyns had agreed? Part II. Would evolutionary theory be different from a population perspective?

What if Rev Jenyns had agreed? Part III. ‘Group’ selection in individuals, too.

But individuals are populations too

Let’s ask something very simple: What is your ‘genotype’? You began life as a single fertilized egg with two instances of human genomes, one inherited from each parent (here, we’ll ignore the slight complication of mitochondrial DNA). Two sets of chromosomes. But that was you then, not as you are now. Now, you’re a mix of countless billions of cells. They’re countless in several ways. First, cells in most of your tissues divide and produce two daughter cells, in processes that continue from fertilization to death. Second, cells die. Third, mutations occur so that each cell division introduces numerous new DNA changes in the daughter cells. These somatic (body cell) mutations don’t pass to the next generation (unless they occur in the germline) but they do affect the cells in which they are found.

But how do we determine your genotype? This is usually done from thousands or millions of cells—say, by sequencing DNA extracted from a blood sample or cheek swab. So what is usually sequenced is an aggregate of millions of instances of each genome segment, among which there is variation. The resulting analysis picks up, essentially, the most common nucleotides at each position. This is what is then called your genotype and the assumption is that it represents your nature, that is, all your cells that in aggregate make you what you are.

In fact, however, you are not just a member of a population of different competing individuals each with their inherited genotypes. In every meaningful sense of the word each person, too, is a i of genomes. A person’s cells live and/or compete with each other in a Darwinian sense, and his/her body and organs and physiology are the net result of this internal variation, in the same sense that there is an average stature or blood pressure among individuals in a population.

[snip]

Is somatic variation important?

An individual is a group, or population of differing cells. In terms of the contribution of genetic variation among those cells, our knowledge is incomplete to say the least. From a given variant’s point of view (and here we ignore the very challenging aspect of environmental effects), there may be some average risk–that is, phenotype among all sampled individuals with that variant in their sequenced genome. But somatically acquired variation will affect that variant’s effects, and generally we don’t yet know how to take that into account, so it represents a source of statistical noise, or variance, around our predictions. If the variant’s risk is 5% does that mean that 5% of carriers are at 100% risk and the rest zero? Or all are at 5% risk? How can we tell? Currently we have little way to tell and I think manifestly even less interest in this problem.

Cancer is a good, long-studied example of the potentially devastating nature of somatic variation, because there is what I’ve called ‘phenotype amplification': a cell that has inherited (from the person’s parents or the cell’s somatic ancestors) a carcinogenic genotype will not in itself be harmful, but it will divide unconstrained so that it becomes noticeable at the level of the organism. Most somatic mutations don’t lead to uncontrolled cell proliferation, but they can be important in more subtle ways that are very hard to assess at present. But we do know something about them.

Evolution is a process of accumulation of variation over time. Sequences acquire new variants by mutations in a way that generates a hierarchical relationship, a tree of sequence variation that reflects the time order of when each variant first arrived. Older variants that are still around are typically more common than newer ones. This is how the individual genomes inherited by members of a population and is part of the reason that a group perspective can be an important but neglected aspect of our desire to relate genotypes to traits, as discussed yesterday. Older variants are more common and easier to find, but are unlikely to be too harmful, or they would not still be here. Rarer variants are very numerous in our huge, recently expanded human population. They can have strong effects but their rarity makes them hard to analyze by our current statistical methods.

However, the same sort of hierarchy occurs during life as somatic mutations arise in different cells at different times in individual people. Mutations arising early in embryonic development are going to be represented in more descendant cells, perhaps even all the cells in some descendant organ system, than recent variants. But because recent variants arise when there are many cells in each organ, the organ may contain a large number of very rare, but collectively important, variants.

The mix of variants, their relative frequencies, and their distribution of resulting effects are thus a population rather than individual phenomenon, both in populations and individuals. Reductionist approaches done well are not ‘wrong’, and tell us what can be told by treating individuals as single genotypes, and enumerating them to find associations. But the reductionist approach is only one way to consider the causal nature of life.

Our society likes to enumerate things and characterize their individual effects. Group selection is controversial in the sense of explaining altruism, and some versions of group selection as an evolutionary theory have well-demonstrated failings. But properly considered, groups are real entities that are important in evolution, and that helps account for the complexity we encounter when we force hyper-reductionistic, individual thinking to the exclusion of group perspectives. The same is true of the group nature of individuals’ genotypes.

The authors of Nature Medicine paper had been looking into somatic mutations in cancer patients for several years. Here is one of their earlier papers on the topic.

Somatic mutations affect key pathways in lung adenocarcinoma

Determining the genetic basis of cancer requires comprehensive analyses of large collections of histopathologically well-classified primary tumours. Here we report the results of a collaborative study to discover somatic mutations in 188 human lung adenocarcinomas. DNA sequencing of 623 genes with known or potential relationships to cancer revealed more than 1,000 somatic mutations across the samples. Our analysis identified 26 genes that are mutated at significantly high frequencies and thus are probably involved in carcinogenesis. The frequently mutated genes include tyrosine kinases, among them the EGFR homologue ERBB4; multiple ephrin receptor genes, notably EPHA3; vascular endothelial growth factor receptor KDR; and NTRK genes. These data provide evidence of somatic mutations in primary lung adenocarcinoma for several tumour suppressor genes involved in other cancers—including NF1, APC, RB1 and ATM—and for sequence changes in PTPRD as well as the frequently deleted gene LRP1B. The observed mutational profiles correlate with clinical features, smoking status and DNA repair defects. These results are reinforced by data integration including single nucleotide polymorphism array and gene expression array. Our findings shed further light on several important signalling pathways involved in lung adenocarcinoma, and suggest new molecular targets for treatment.

———————————————————————————

Personalized Medicine

Given somatic mutations exist, we can always continually sequence the genomes or exomes of various tissues of individuals and detect diseases, right? The new paper suggests that such an approach would lead to getting huge number of false positive cancer diagnosis.

Blood Cancer-Related Mutations Commonly Found in Older Individuals

Significant differences in the presentation of mutations in leukemia- or lymphoma-associated genes were observed between age groups. Mutations were noted in 0.88% of individuals aged 40-49 (n = 341), 0.99% of individuals aged 50-59 (n = 707), 1.83% of individuals aged 60-69 (n = 767), 5.26% of individuals aged 70-79 (n = 475), and 6.06% of individuals aged 80-89 (n = 132).

However, this analysis may be underestimating the proportion of individuals with such mutations, the authors wrote, as they only identified small mutations. There remains a need to identify large structural variations or the insertions and deletions of chunks of genetic material associated with hematologic malignancies.

The authors on the study noted that a clinician must be careful in using blood as a surrogate reference. There is a risk of blood-specific variants being mistaken as germline variants in individuals without hematologic malignancies. Further, “germline alleles in cancer samples could be mistaken as tumor-specific variants when comparing to blood samples. Lastly, connections were made between mosaic PPM1D mutations in lymphocytes and the predisposition to breast and ovarian cancers,” the authors wrote.

The authors stated that testing for mutations associated with leukemia and lymphoma would not predict an individual’s risk of a hematologic malignancy.

“We would not want anyone to think they should be screened for these mutations to understand their risk of leukemia or lymphoma,” Timothy Ley, MD, co-author and professor from Washington University said in a statement. “The ability to understand how mutations in these genes increase a person’s risk of blood cancers is a long way off, and genetic testing would be of no benefit at this time.”

And that is just one kind type of disease. Imagine a 40-year old healthy individual going through similar genetic screening for hundreds of diseases, with each type of disease having 1% of false positive rate (5% for 70 year old). Promoters of personalized genetic testing need to think through these issues before making their methods available at all clinics.

## Cuckoo Filter to Replace Bloom Filter?

@rozovr forwarded the link to a new paper that is worth looking into.

Cuckoo Filter: Practically Better Than Bloom

In many networking systems, Bloom filters are used for high speed set membership tests. They permit a small fraction of false positive answers with very good space efficiency. However, they do not permit deletion of items from the set, and previous attempts to extend “standard” Bloom filters to support deletion all degrade either space or performance. We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.

The name derives from their use of Cuckoo hash scheme. Compared to Bloom filter –

We propose the cuckoo filter, a practical data structure that provides four major advantages.

1. It supports adding and removing items dynamically;

2. It provides higher lookup performance than traditional Bloom filters, even when close to full (e.g., 95% space utilized);

3. It is easier to implement than alternatives such as the quotient filter; and

4. It uses less space than Bloom filters in many practical applications, if the target false positive rate  is less than 3%.

Will it replace Bloom filter based k-mer counting methods? Pall Melsted offers his perspective –

## Tragedy of the Day – New ‘Revealing’ Human Transcriptome Paper at Biorxiv

Someone forwarded me the link to a new human transcriptome paper, and as an author of the first genome-wide human transcriptome paper exactly 10 years back, I decided to check how far the field advanced over the decade. What I found is quite shocking and it is abundantly clear that these ENCODE-clowns have not learned any lesson through their public humiliation.

The paper examines the transcriptomes of ~400 female twin pairs (~800 individuals), throws all kinds of statistical analysis tools and reports absolutely nothing interesting. No new pathways, mechanisms, nothing. Based on that summary, I thought the title would be ‘comparative analysis of transcriptomes of 800 UK twins shows nothing’. Instead the authors chose ‘transcriptome sequencing reveals widespread gene-gene and gene-environment interactions’. Oh, please !!

BS starts right from the abstract. The last statement of the abstract says – ‘We uncover large GxG and GxE effects on gene expression and likely complex phenotypes that currently remain elusive’. We pondered about the later part of the sentence and could find absolutely no meaning whatsoever. Do the authors mean ‘we uncover complex phenotypes that currently remain elusive’? How can you uncover something that you do not uncover anywhere in the paper? The best explanation is that the authors wanted to include the words ‘complex phenotype’, because that is what gets them money these days. Otherwise, they could as well say ‘we uncover likely the hand of God’ and be as speculative.

The main text starts with –

Gene expression is a cellular phenotype that informs about the functional state of the cell. Gene expression by itself is a complex trait that depends on genetic and environmental causes.

In plain English, they are saying ‘gene expression’ is now the new definition of function/phenotype, and with that broadened definition of phenotype, here comes our speculative paper. ENCODE 3 is reborn !! That is far from the original intention of such research projects, where they were supposed to explain real phenotypes, not redefine phenotype. However, as Sydney Brenner explained, the inverse problem of physiology cannot be solved for mathematical reasons. Hence comes the redefinition and hiding behind ‘complex phenotype’ metaphors.

A high point of the paper is to go through a long list of genes, find one or two well known members through cherry-picking process and then assign some kind of meaningless biological story. The authors picked Epstein-Barr virus induced 3 gene EBI2.

One of the top hits in LCLs is the Epstein-Barr virus induced 3 gene (EBI3) (Figure 3). That means that ASE at the EBI3 gene depends on the interaction of cis genetic variants and an environmental factor that, likely, in this case is the transformation process of the B cells with EBV.

By now, we have learned that all GWAS-type papers end with the obligatory statement that the authors could not uncover something, because their sample size was small. That later statement then becomes the first sentence in the follow-up grant application asking for more money to do the study with 4000 twins. This paper is no exception.

We found an example of GxE interaction on gene expression that has been widely described in the literature (Adiponectin) supporting the validity of our approach. However, the limitation on power due to the sample size prevented us to discover specific associations in most other cases.

Not that lack of funding was any obstacle for their work.

Acknowledgements: This work has been funded by the EU FP7 grant EuroBATS (No. 259749) which also supports AAB, AB, MND, AV, TDS. AAB is also supported by a grant from the South-Eastern Norway Health Authority, (No. 2011060). RD is supported by the Wellcome Trust (No. 098051). The Louis-Jeantet Foundation, Swiss National Science Foundation, Eurpean Research Council and the NIH-NIMH GTEx grant supports ETD. TS is an NIHR senior Investigator and holder of an ERC Advanced Principal Investigator award. JBR and HZ are supported by the Canadian Institutes of Health Research, Fonds de Recherche Sante du Quebec, and the Quebec Consortium for Drug Discovery. Most computations were performed at the Vital-IT (http://www.vital-it.ch) Center for high-performance computing of the SIB Swiss Institute of Bioinformatics. The TwinsUK study was funded by the Wellcome Trust; EC FP7(2007-2013) and National Institute of Health Research (NIHR). SNP Genotyping was performed by The Wellcome Trust Sanger Institute and National Eye Institute via NIH/CIDR.

In summary, this paper is so meaningless that we suspect it will be channeled into Nature through Magdalena Skipper.

———————————————————————————————

Speaking of constructive criticisms, could the same research be done in a different way? We note that typical biological problems are solved based on three independent sources of data –

1. Palentology (fossil records), which gives the historical perspectives,

2. Phenotypes observed from cell biology and developmental biology,

3. Genetics, which provides as the informational system.

In case of human biology, family history gives information equivalent to palentology.

So, if a researcher is interested in identical twins, he needs to first identify a set of observables based on either family history or looking at the cells under the microscope, and then see whether she can find genetic data to explain the observation. If they do not have any observable to explain, maybe it is time to move on to the next problem.

Instead these authors start by collecting genetic data without any observable to explain and then go on a fishing experiment. When that fails, they redefine gene expression as a ‘complex phenotype’ and write a boring paper to explain nothing.

## FPGA-accelerated Bioinformatics at #ASHG – Dragen Aligner from Edico Genome

We thank Max Shen for his generous contribution to support our blog.
———————————————————–

We covered several CPU and GPU-based algorithms in “ASHG/GA4GH Special – A Recap of Interesting NGS Algorithms of the Year”, but no FPGA-accelerated one. Readers interested in FPGAs will find the following ASHG abstract interesting. At the bottom, we discuss the pros and cons of using these three hardware alternatives in bioinformatics.

Enhanced fetal aneuploidy detection using hardware accelerated alignment

Session Title: Bioinformatics and Genomic Technology Session Type: Poster
Session Location: Exhibit Hall, Ground Level, Convention Center Session Time: Mon 10:00AM-4:30PM

Program Number: 1675M Presentation Time: Mon, Oct 20, 2014, 2:00PM-3:00PM
Keywords: Bioinformatics and Genomic Technology, KW008 – bioinformatics, KW146 – prenatal diagnosis, KW031 – computational tools, KW103 – massively parallel sequencing, KW023 – chromosomal abnormalities

M. Sykes1, C. Roddey2, M. Ruehle2, R. McMillen2, P. Whitley1
1) Sequenom Inc., San Diego, CA; 2) Edico Genome Inc., La Jolla, CA.

Noninvasive prenatal testing can be performed by massively parallel sequencing of DNA from maternal plasma. This method has been shown effective in the detection of fetal aneuploidies of chromosomes 13, 18, 21 and the sex chromosomes. Accurate classification of these aneuploidies requires, in part, alignment of sequencing reads to the human genome, calculation of chromosome fractions based on these alignments and calculation of z-scores for each chromosome based on these fractions. The success of these steps relies upon the choice of aligner and algorithm used to determine the chromosome fractions.

Here we present reclassification of a dataset of 1269 samples previously analyzed using bowtie 2 as the aligner. In this study alignments are generated by the DRAGEN processor, a hardware-accelerated sequencing analysis system developed by Edico Genome. We report systematic differences between the two aligners but equivalent performance in terms of chromosome fraction variability and thus chromosome quantification.

Both the bowtie 2 and DRAGEN based analyses successfully identified all known T13, T18 and T21 cases in the dataset. The sensitivity and specificity were both > 99.9% in each classification. At the same time the DRAGEN system provides speed increases of greater than thirty-fold relative to bowtie 2 running with 6 threads on a 3.5 GHz Xeon CPU, allowing a single computer to replace the efforts of a small cluster.

These results demonstrate that the classification algorithm for fetal aneuploidy is robust and resistant to localized changes in the alignment profile. Furthermore the DRAGEN system provides equivalent performance to bowtie 2 with a significant increase in speed.

A Comparison of CPU, GPU and FPGA

CPUs

CPUs are generic processors coming with a small set of hardware-accelerated commands (e.g. ‘add two numbers’, ‘multiply two numbers’, ‘compare two numbers’, ‘jump to a different statement’, etc.). The programmers write code in a high-level language such as C/C++, and then a compiler translated the code into the language that the CPU understands. For example, if someone wants to add five numbers, the compiler will instruct the processor to use ‘add two numbers’ command multiple times.

Advantage – Flexibility in coding. If the biological problem changes from the original one, the software code can be tweaked slightly to solve an entirely new problem. The compiler takes care of making sure the tweaked software code is understood by the hardware.

Disadvantage – Scalability related to memory bandwidth.

The Future of Computers – Multicore and the Memory Wall

The scalability is not always inherent in the architecture, but comes at times from programmers lack of knowledge about the hardware architecture. After all, the promise of software programming is to make the programmers remain oblivious to what is going inside the chip. However, some programmers do know quite a bit about the chip, and use their knowledge to get around the memory bandwidth issue. Here is a beautiful example from Gene Myers –

In DALIGN Paper, Gene Myers Delivers a Major Blow to His Biggest Competitor

GPUs

The computer display is one specific component, where the issue of processing speed had been most critical. Users were very sensitive to slow displays and were always ready to pay top dollars for a screen that showed the images quickly. GPUs were born from that demand. They are highly parallelized processing chips, except that during most of their early history, GPUs were only used for processing graphics-related instructions.

Advantage Better scalability with large number of parallel executions than CPUs.

(i) All instructions executing at a time have to be of the same type.

(ii) It is important to have some knowledge of the hardware. The companies are trying to remove this restriction by coming up with open-CL and other GPU-specific ‘compilers’, but it is not possible to take full advantage of the hardware without knowing how it works.

FPGA

FPGAs are hardware chips, which can be reprogrammed to solve any specific problem.

Disadvantage Lack of flexibility and the cost associated with this lack of flexibility.

Historically, the bioinformatics land had been the graveyard of dead FPGA companies.

There are two problems –
(i) The algorithmic landscape change too fast,
(ii) The users want to see output from specific programs (‘give me results matching BLAST or HMMER’) rather than correct results following the same algorithm.

Getting back to the current abstract, more details about the technology can be found here. Based on their numbers, cost for each accelerated server and associated expenses is around $3M for over 4 years. Are we doing the calculation right? The extra computing cost of each genome is$1130-$1000=$130 (see below). For 72,000 genomes, the total computing cost is $9.36M using CPU-based technology (i.e. 50 Dell servers). With Dragen, they claim the cost savings to be$6M, or the total cost of their server = 3.36M. Earlier this year Illumina announced their HiSeq X Ten, which is a cluster of 10 HiSeq X instruments capable of sequencing up to 18,000 whole human genomes each year with continuous operation. The HiSeq X Ten has a run cycle of ~3 days and produces ~150 genomes each run cycle. This means that the data generated must be analyzed within 3 days or a backlog will occur. Simple math thus provides a target of ~28 minutes for the completion of the mapping, aligning, sorting, deduplication and variant calling of each genome. Running the industry standard BWA+GATK analysis pipeline to perform this analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome. To achieve the required throughput of 150 genomes every three days, at least 50 of these servers are required. With DRAGEN’s run time for the full pipeline being well under 28 minutes, only a single card is required to handle the data produced by a full HiSeq X Ten system running at maximum capacity! When calculating the costs associated with the legacy compute server infrastructure required to analyze and store the data produced by the HiSeq X Ten, we use the same method that Illumina themselves use to calculate their “1,000 Genome.” Equipment costs are amortized over 4 years and staff and overheads are included.

By taking into account standard pricing from Dell for the 50 compute servers, industry standard costs for associated IT professionals, rack space and power, as well as AWS pricing for upload and storage of the genomic data over a 4 year period, the $1,000 Genome becomes a$1,130 Genome.

By comparison, DRAGEN significantly reduces the compute cost over the same period while also removing the need for IT professionals and lowering power and rack space requirements. In addition, since DRAGEN implements CRAM-based compression within its pipeline to transparently compress the large amounts of data to be uploaded and stored for later re-analysis, these costs are also significantly lower. When combining all of these savings over the 72,000 genomes sequenced during the 4 year period, DRAGEN provides cost savings of ~\$6 million!

## Renowned Human Geneticist Michael Hammer Shamed, Joe Pickrell Continues to Support Bad Science

In a world of universal deceit, telling the truth is a revolutionary act – George Orwell

The same spirit applies to the world of population genetics. Needless to say, nobody likes a revolution and the least among which is Joe Pickrell.

We decided to check what the hoopla is about and found a rebuttal paper from Eran Elhaik, Dan Graur and collaborators newly posted in arxiv. We have covered previous episodes of this saga in some detail several months back, and the readers can check “Dan Graur’s Latest Bombshell and Implications for Jewish Genetics”. Briefly, Michael Hammer’s group published a paper in 2013 (Mendez et al.), where they claimed to have rewritten the text book on human evolution. Then, Elhaik et al. redid the analysis and found it to be full of incorrect assumptions.

The most interesting part of the latest rebuttal is in its last section, where an author (Thomas Krahn) of the original Mendez et al. paper acknowledged by email that he and other authors were well aware that the newsworthy claim of the paper could be an artifact of incorrect assumptions made in the analysis. What else do you need?

To learn what Dr. Krahn thinks about the time estimates in Mendez et al. (2013), the missing exchange is published below with Dr. Thomas Krahn’s kind permission.

TK: “While I agree that the outrageous time estimates for A00 from Fernando [Mendez] need an urgent correction and discussion, I don’t think that your argumentation yields enough weight to question the fact that Perry’s Y does in fact represent an ‘extremely ancient’ haplogroup.”

EE: “I am just a bit skeptical about some of their statements in the paper, that the A00 predates human and the calculation of the Y tree in their Figure 1, that doesn’t sound right.”

TK: “Yep, we were extensively discussing this issue. My numbers were more like [your] 200ky for the A00 split but Fernando [Mendez] insisted using autosomal SNP frequency data for dating, which I thought is absolute nonsense for the Y mutation mechanism. Anyhow, it was his job to do the dating maths.”

Lately it has become extremely difficult to correct scientific records, and we can write volumes on the topic based on our experience with the junk science promoted by the positivity lady (check Tragedy of the Day: PNAS Got Duped by Positivity Lady !!, Our New PNAS Paper Debunks the Genomics of Positivity). Her positivity ‘science’ is now part of Arianna Huffington’s latest book available in every bookstore and public library, even though we have shown that the entire paper is garbage. People like Joe Pickrell pose particular challenge to those trying to highlight these examples of extremely bad science, and we have experienced similar resistance with our rebuttal paper of positivity paper from others in the field of psychology. In case of Mendez et al., when the original rebuttal paper of Elhaik et al. came out in January, Joe Pickrell was not happy. He is still not happy with Graur and Elhaik, even though it turns out that the authors of the original paper deliberately tweaked their analysis. This must be the post-modern way of finding truth.

There is no crime, absolutely none, that cannot be condoned when ‘our’ side commits it. – George Orwell

——————————————————————————————————————–
Readers may also find the blog post by Elhaik on the arxiv paper helpful.

The flabbergasting science of Fernando Mendez, Michael Hammer, and co-authors

When I published The missing link of Jewish European Ancestry1, currently ranked #2 most-read paper in GBE, Michael Hammer called me:

“outlier folks… who have a minority view that’s not supported scientifically. I think the arguments they make are pretty weak and stretching what we know.”

Since Hammer and I consider science to be two different things, I took it as a compliment. Little did I know back then that my study rocked Hammer’s world, because at the very same year he and Fernando Mendez published their own outlier paper2 dating the most ancient Y chromosome (A00) to over 338,000 years ago, almost twice the age of Anatomically Modern humans according to fossil evidence. Mendez et al.2 were not coy at all in their most abnormal finding and proudly announced that:

“In contrast to the estimates of all previous studies, our estimate of the Y chromosome TMRCA predates both that of known mtDNA lineages and of the likely time of origin of AMHs on the basis of current fossil evidence.”2

Upon reading their paper, it became obvious to us that Hammer missed (once again) the whole point of doing science. For us, science is a systematic approach that produces accurate and replicable results to explain the universe rather than falsify them. In our critique we have shown3

“that consistently throughout their examination, Mendez et al2 have chosen the assumptions, approximations, numerical miscalculations and data manipulation that inflated the final TMRCA estimate.”

This may not come as a complete shock to the reads of this blog and those of us intimately familiar with Hammer’s previous studies on the Jewish people and particularly the Kohanim4-6, which were challenged by myself and other authors7,8.

As often happens, our critique was very well received and got a media coverage that outshined the original fabricated piece. Even The Holy See commented on that through his speaker, which was quiet surprising because he rarely comments on scientific papers (the comment was primarily meant to negate FOX’s allegation as if we had claimed that we dated the biblical “Adam.” FOX news are somewhat illiterate when it comes to science). A short video we published on YouTube was viewed almost 10,000 times in three days and altogether ~15,000 times! As far as we know, this is the first video ever produced to explain a rather dense scientific paper in plain terms. You can watch it below:

With their mischiefs now becoming public knowledge, it was not unexpected that Hammer and Mendez would response. Amazingly, they remained determined to defend their right to falsify the results even after 55% of their co-authors2 abandoned their sinking flotilla. Again, to those of us familiar with their previous work, this came as no surprise.

What did surprise us was that they also decided to publish “all” of our personal correspondence without permission while neglecting some of the most interesting emails that revealed their practices. Unfortunately, the journal censored these and other parts of our reply.

Mendez et al. (2014) rebuttal along with our vasectomized response now appear in EJHG, but since Fernando Mendez could not restrain himself and published a rebuttal online, we published our original and complete response in arXiv, including the forbidden correspondences, with permission by their author, of course. Among else, you may find the following correspondence between myself (EE) and the second co-author of the first paper2 who is also the original discoverer of the A00 haplotype, Thomas Krahn (TK) back in May 2013:

TK: “While I agree that the outrageous time estimates for A00 from Fernando [Mendez] need an urgent correction and discussion, I don’t think that your argumentation yields enough weight to question the fact that Perry’s Y does in fact represent an ‘extremely ancient’ haplogroup.”

EE: “I am just a bit skeptical about some of their statements in the paper, that the A00 predates human and the calculation of the Y tree in their Figure 1, that doesn’t sound right.”

TK: “Yep, we were extensively discussing this issue. My numbers were more like [your] 200ky for the A00 split but Fernando [Mendez] insisted using autosomal SNP frequency data for dating, which I thought is absolute nonsense for the Y mutation mechanism. Anyhow, it was his job to do the dating maths.”

The readers are invited to guess how many of co-authors would remain on Mendez’s shipwreck for their next duly rebuttal. My guess is that their next paper will not be cited as “Mendez et al.” anymore.

REFERENCES

1 Elhaik, E. The Missing Link of Jewish European Ancestry: Contrasting the Rhineland and the Khazarian Hypotheses. Genome Biology and Evolution 5, 61-74, doi:10.1093/gbe/evs119 (2013).

2 Mendez, F. L. et al. An african american paternal lineage adds an extremely ancient root to the human y chromosome phylogenetic tree. Am. J. Hum. Genet. 92, 454-459, doi:S0002-9297(13)00073-6 [pii]

10.1016/j.ajhg.2013.02.002 (2013).

3 Elhaik, E., Tatarinova, T. V., Klyosov, A. A. & Graur, D. The ‘extremely ancient’ chromosome that isn’t: a forensic bioinformatic investigation of Albert Perry’s X-degenerate portion of the Y chromosome. Eur. J. Hum. Genet., doi:10.1038/ejhg.2013.303 (2014).

4 Behar, D. M. et al. The genome-wide structure of the Jewish people. Nature 466, 238-242, doi:nature09103 [pii]

10.1038/nature09103 (2010).

5 Skorecki, K. et al. Y chromosomes of Jewish priests. Nature 385, 32, doi:10.1038/385032a0 (1997).

6 Hammer, M. F. et al. Extended Y chromosome haplotypes resolve multiple and unique lineages of the Jewish priesthood. Hum. Genet. 126, 707-717, doi:10.1007/s00439-009-0727-5 (2009).

7 Tofanelli, S. et al. J1-M267 Y lineage marks climate-driven pre-historical human displacements. Eur. J. Hum. Genet. 17, 1520-1524, doi:10.1038/ejhg.2009.58 (2009).

8 Klyosov, A. A. A comment on the paper: Extended Y chromosome haplotypes resolve multiple and unique lineages of the Jewish Priesthood by M.F. Hammer, D.M. Behar, T.M. Karafet, F.L. Mendez, B. Hallmark, T. Erez, L.A. Zhivotovsky, S. Rosset, K. Skorecki, Hum Genet, published online 8 August 2009. Hum. Genet. 126, 719-724; author reply 725-716, doi:10.1007/s00439-009-0739-1 (2009).

## Nanopore Updates from David Eccles

——————————————————
His first comment on 9/9/2014

As a participant of the MinION access programme (MAP), I have agreed to not report on technical information on the MinION sequencer (and specifically the control lambda DNA samples sent to us by Oxford Nanopore) until our laboratory has decided that the results we’re getting from the device on our own data are acceptable to us. That’s the main reason there’s such a dearth of information at the moment, and the main reason why it’s difficult for me to discuss the details of sequencing.

Invitations to the programme were given in February 2014 (i.e. this year), and the registration process (including paying money and validating that software was working okay) was complete for most people by May 2014. While we were initially told this would be a 6-week process, it has stretched out to be much longer than that, and it is only now that the second wave of participants are coming into the programme.

I expect that we were one of the last labs in the first wave of participants to receive our first sample kits, and we’ve had issues with sample preparation such that we haven’t yet had an acceptable run with our own data (mouse mitochondria). Our lab in New Zealand had shipping issues due to various reasons, and we weren’t able to do our first run until around mid July 2014.

Since then, there have been only a couple of dribbles of “data” from people who have self-certified and consider the MinION good enough for their uses. The most recent that I’m aware of is a video recording that Nick Loman made of a 2-hour sequencing run. You can see the final read length histogram here:

The output files from that run has yet to be released, but it is possible to see from the video the following statistics:

[numbers are estimated from the event data produced by the software, and the number of mappable bases will be less than this]

About 45 Mbp of sequence were generated over the course of the run. The device produced a little under one read per second, so the mean [estimated] read length is about 7kbp. To reduce confusion slightly, the histogram shown in the MinKNOW program here is adjusted so that the bin height displays the summed length of sequences in each bin, rather than the number of sequences. As an example of interpretation for this run, sequence data was mostly generated from sequences that were about 20kbp long.

I’m happy to answer questions about this, but be aware that I am a little bit restricted in that, and can only report on *technical* information that has been produced and release by other self-certified labs.

As an aside, what seems to be missed from much of this discussion is just how different this technology is. I consider it a disruptive technology, and as such, most of the applications for the MinION sequencer have yet to be discovered. Here’s my rough summary of the human-visible parts:

The MinION is a small disposable device about the size of a muesli bar (or granola bar) that plugs into the USB port of a computer, with power for sequencing and fluidics provided by the computer and your pipette. Once samples are loaded onto the device, it will produce a continuous stream of electrical sensor data in real-time that is processed on the computer to identify DNA bases that are moving through each of the 512 pores. Based on a couple of published/released graphs that I’ve seen (from Nick Loman’s lab), the sequencer is producing 10-25M sequence events per hour. These events are converted by base calling software (currently uploaded to and downloaded from a cloud service) into base-space sequence, resulting in a near real-time production of generated sequence.

Unlike other DNA sequencers, there is no DNA amplification required to run samples on the MinION. Sample preparation is reduced to mixing by inversion, spinning down liquids using a fist-sized microcentrifuge, pipetting, and room-temperature incubation to carry out DNA processing reactions — all stuff that could be done in an hour or so in the back of a Jeep.

Unlike other DNA sequencers, the MinION carries out sequencing by directly observing the DNA bases as they move through the detection pore; the sequencing process is observing the exact template strand of interest, rather than a synthesised copy. An electrical current is passed across the pore, and this changes depending on what type of base is moving through the pore at the time. The MinION produces a real-time trace of electrical signals for each pore, allowing the run to be stopped as soon as an interesting sequence is found, then re-loaded with a new sample and restarted.

Unlike other DNA sequencers, there is no limit to the length of sequences that are fed into the MinION [although sequences that are too long can get broken up by pipetting, so the recommended upper limit under standard use is about 50kb]. Instead of fragmenting DNA into hundreds or thousands of pieces prior to sequencing, much longer sequences can be used, allowing sequencing across regions of duplication or low complexity.

——————————————————

> The technology has many similarities with Pacbio except that the readout here is electrical instead of optical and the sequencing instrument is tiny rather than huge.

I wouldn’t discount the actual sequencing process. To reiterate one of my previous points, “the MinION carries out sequencing by directly observing the DNA bases as they move through the detection pore.” PacBio uses fluorescently-tagged nucleotides for sequencing, and requires strand synthesis to occur in order to determine sequence — it’s still sequencing by synthesis. That means you can’t tell the precise nature of the template strand because you have to plan in advance what to expect. This is not a particularly interesting distinguishing feature at the moment, but I expect it to become more useful in the future as others determine how different base modifications (or different bases) influence the pore signal.

Ignoring that, The “signal” part of the technologies comes down to another optical vs electrical fight (like we’ve already seen with IonTorrent vs MiSeq). All optical methods (so far) are a bit more complicated because they require modified fluorescently-tagged (or radiolabelled) nucleotides in order to carry out sequencing.

> So, the obvious question would be how the signal quality of electrical readout is better than the optical readout that PacBio has.

This is, obviously, delving into technical details. At the moment all I feel comfortable with saying is that the quality of what Oxford Nanopore are able to produce internally is higher than what is being observed in the field. This is expected, and the whole point of the MAP — Oxford Nanopore want to work out what needs to be done in order to get others to see the kind of results that they are seeing in-house. If you’re interested in getting more detailed answers (or at least interested in asking similar questions), I suggest you attend or review the Google hangout that’s happening in about 12 hours (not wonderful timing for you, unfortunately):

Clive Brown is very aware of your post, and is likely to try to address at least some of the point you have raised in that hangout.

——————————————————

Doing a kmer analysis is a good idea… I’ll have to add that to my standard workflow. However it doesn’t quite tell the full story about the error profile. Because the nanopore base calling is event/time based in an asynchronous fashion (like PacBio), it’s very prone to extra events and skipped events (i.e. INDELs).

Illumina has a much greater advantage here because they use reversible terminators for sequencing, which usually means that it’s one base sequenced per cycle for all spots on the flow cell. When the “usually” fails (i.e. failing to add a base, or adding too many bases), you get phasing errors and a drop in quality (hence the upper limit on sequencing length due to the accumulation of phasing errors).

Nanopore doesn’t have issues with length — the fifth base call is just as good (or bad) as the fifty thousandth, but you get less reliability because the template doesn’t progress through the pore at a constant speed. A combination of better base calling (i.e. software) and better motor enzymes (i.e. wetware) might reduce those errors in the future, but I expect they’ll never go away entirely.

——————————————————

Okay, so now that we’re self-certified, here’s my take on it. Here’s a base call frequency plot:

Each base call at each location is plotted in a different colour, and the reference base is circled. It demonstrates what ~25% base-call error looks like when the alternate error is fairly random — errors for most non-reference bases are sitting at around the 5% level. There is a little deletion bias, but for SNPs the genotyping already works pretty well.

What I find very interesting is that almost all of this error is a software problem, and you can see quite clearly where the base-calling algorithms are lacking by generating raw trace data together with your run. In other words, the flow cell is great, the chemistry is great, the default sampling resolution is sufficient, but the base calling and event detection algorithms are a bit too immature to identify bases with high reliability.

I’m still cleaning up my plots for that, but I should have something to show in a week or so.

You may also be interested in these probability tables (from a different run than that pictured, but similar results):

Reference Base Probability [sequenced base is the left-most column]
A C G T
A 0.95 0.022 0.014 0.014
C 0.055 0.877 0.021 0.048
G 0.097 0.069 0.743 0.09
T 0.02 0.035 0.016 0.929
i 0.366 0.239 0.116 0.279
d 0.342 0.287 0.124 0.247

Sequenced Base Probability [reference base is the left-most column]
A C G T i d
A 0.844 0.034 0.038 0.015 0.019 0.049
C 0.029 0.814 0.04 0.038 0.019 0.06
G 0.035 0.037 0.826 0.034 0.017 0.051
T 0.015 0.036 0.043 0.846 0.018 0.043

——————————————————

While I’m still working out the best way to display the raw signal data, I’ve finally been able to bundle it up into something that should be suitable for graphing. Due to file sizes, I currently only hava a file containing the event data + signal data for the first 6 reads from one of our experiments, but it should be good enough to get a feel for the data:

https://mega.co.nz/#F!zZAiWR7A!gF86n5f2YAHmaLsQTnMswA

The data is stored as an R data object (load using the ‘load’ function of R), and contains the following information:

* mtDNA reference sequence
* base-called sequences
* read type (template / complement / 2D)
* channel# / read# / file#
* event file name
* scaling information for adjusting between raw signal and event data
* event data
* raw signal ranges corresponding to the event data
* raw signal

I have excluded my mapping from base-called to reference sequence, because mapping is a very subjective thing.

## ASHG/GA4GH Special – A Recap of Interesting NGS Algorithms of the Year

If you find our commentaries useful, please feel free to press the donate button on the right sidebar and your name will be included in the table at the bottom of the page. We thank Pawel Osipowski, a PhD student from Poland, for making today’s generous contribution to support our blog.

Section 1 covers algorithms for short reads (primarily Illumina). The reads are clean and the main problem is the volume of data. The size of each library can be 50-100G and imagine doing that for 10,000 human genomes together. The categories for challenging problems are – base-calling, kmer counting (the most basic step), de Bruijn graph construction, genome assembly, BWT construction of large library, compression of large libraries, searching through many such genomes and expression analysis.

Section 2 covers long noisy reads primarily from PacBio, but should also be equally applicable for nanopore data. The challenge here is how to take care of the noise in the reads.

Section 3 covers GPU acceleration, which is one of the ways to go over the data analysis bottleneck.

Section 4 asks the question of whether it is really necessary to keep scaling up from 10K genomes to 100K genomes and so on, or can we answer biological questions by trying a different route.

Section 5 mentions two new blogs started this year by skilled bioinformaticians.

Section 6 adds a number of interesting papers missed in the first cut. We went through all issues of arxiv/biorxiv and tried to find every interesting paper.

Enjoy and feel free to comment, distribute, criticize, flame us or whatever !!

————————————————————
————————————————————

KMC – Minimizer-based Kmer Counting Algorithm for Short Reads

KMC 2: Fast and resource-frugal k-mer counting

Motivation: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.

Results: We present a novel method for k-mer counting, on large datasets at least twice faster than the strongest competitors (Jellyfish~2, KMC~1), using about 12\,GB (or less) of RAM memory. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k,x)-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC~2 allows to count the 28-mers of a human reads collection with 44-fold coverage (106\,GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD.

We also discussed another efficient kmer counting program – scTurtle – in our blog.

————————————————————
BCALM – Chikhi et al.’s Algorithm for dBG Construction using Minimizer

De Novo Assembly of Human Genome with Only 1.5 GB RAM

The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these types of approaches. We further design and implement a general data structure (DBGFM) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of DBGFM, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use DBGFM.

————————————————————
Kraken – Minimizer-based Algorithm for Metagenome Classification

Kraken: Ultrafast Metagenomic Sequence Classification Using Exact Alignments

Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.

————————————————————
BWT Construction for Large Short-read Library

Heng Li’s ropebwt2 paper.

Fast construction of FM-index for long sequence reads

We mentioned Heng Li’s ropebwt2 code at github in the past, but now he discloses the algorithm in a brief paper posted at arxiv. We will spend the rest of the day trying to understand what is going on.

Summary: We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one that practically works for reads of averaged kilobases in length. Availability and implementation: https://github.com/lh3/ropebwt2 Contact: hengli@broadinstitute.org

Is constructing FM index still relevant in the ‘Minimizer’ world?

Rayan Chikhi worked through an example here that you will find helpful.

Also check –

BWT Construction Parallelized in ‘Parabwt’

BEETL-fastq: a searchable compressed archive for DNA reads

FASTQ is a standard file format for DNA sequencing data which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Text compression tools such as gzip are often employed to reduce the storage burden, but have the disadvantage that the data must be decompressed before it can be used.

Here we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip, but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.
Results:
We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000, a million of 30-mers in respectively 3, 8, 14, 45 and 567 seconds plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and “in silico pull-down” experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.
Availability:
BEETL-fastq is part of the BEETL library, available as a github repository at git@github.com:BEETL/BEETL.git.

————————————————————
Compression of Large Libraries

BARCODE – Fast lossless compression via cascading Bloom filters

Background
Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.

Results
We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.

Conclusions
Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested
compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.

Minimizer Success – Disk-based Genome Sequencing Data Compression

We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Most compressors for such data, usually in FASTQ format, cannot efficiently compress the DNA stream if a given dataset is large, since the redundancy between overlapping reads cannot be easily exploited in the main memory, often a few times smaller than the amount of data to compress. More interesting solutions for this problem are disk-based~\cite{Y2011,CBJR2012}, where the better of these two, from Cox~et al.~\cite{CBJR2012}, is based on the Burrows–Wheeler transform (BWT) and achieves 0.484\,bits per base for a 135.3\,Gbp human genome sequencing collection with 45-fold coverage. Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.376\,bits per base as the compression ratio.

————————————————————
Searching from Many Genomes

David Haussler and colleagues had been working on how to align a new read library against thousands of human genomes. The easy solution of performing the same alignment thousand times does not scale well. Benedict Paten, Adam Novak and David Haussler posted a new paper in arxiv to address this question. This appears like an interesting paper, but we have not read it carefully to understand the innovation beyond what Juni Siren (corrected) and others proposed before.

To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for mapping each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for mapping to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.

Readers may also find the following work from Haussler group relevant –

HAL: a Hierarchical Format for Storing and Analyzing Multiple Genome Alignments

————————————————————

When it comes to genome assembly, SPAdes group stands way apart from others by publishing many new developments every year.

SPAdes Group’s Long-awaited Repeat Resolver Paper is Published

dipSpades Beats Haplomerger Hands Down in Diploid Assembly

Also, Pavel Pevzner, who leads this group, published an interesting paper on manifold de Bruijn graph. We are not sure whether it is incorporated in SPAdes yet.

Manifold de Bruijn Graphs – Yu Lin and Pavel A. Pevzner

Genome assembly is usually abstracted as the problem of reconstructing a string from a set of its k-mers. This abstraction naturally leads to the classical de Bruijn graph approach—the key algorithmic technique in genome assembly. While each vertex in this approach is labeled by a string of the fixed length k, the recent genome assembly studies suggest that it would be useful to generalize the notion of the de Bruijn graph to the case when vertices are labeled by strings of variable lengths. Ideally, we would like to choose larger values of k in high-coverage regions to reduce repeat collapsing and smaller values of k in the low-coverage regions to avoid fragmentation of the de Bruijn graph. To address this challenge, the iterative de Bruijn graph assembly (IDBA) approach allows one to increase k at each iterations of the graph construction. We introduce the Manifold de Bruijn (M-Bruijn) graph (that generalizes the concept of the de Bruijn graph) and show that it can provide benefits similar to the IDBA approach in a single iteration that considers the entire range of possible k-mer sizes rather than varies k from one iteration to another.

————————————————————
BESST for Scaffolding During Genome Assembly

BESST is a package for scaffolding genomic assemblies. It contains several modules for e.g. building a “contig graph” from available information, obtaining scaffolds from this graph, and accurate gap size information (based on GapEst http://www.ncbi.nlm.nih.gov/pubmed/22923455). For installation, see docs/INSTALL.txt.

————————————————————
Sailfish for RNAseq Expression Analysis

For RNAseq expression profiling, Sailfish reduced the time of analysis by doing it straight from kmers instead of reads.

Check –

Goodbye RSEM, Sailfish Paper Published

We wrote about very fast Sailfish algorithm, when it came out in the arxiv (DEseq and Sailfish Papers for RNAseq). Also, one of our readers liked the code and we wrote about it as well –

Good C++ Development Practices in Sailfish Code

Now the readers can find the published paper in Nature Biotech. There is no excuse for not using it.

————————————————————
Blindcall for Base-calling

BlindCall: Ultra-fast Base-calling Algorithm Using Blind Deconvolution

Motivation: Base-calling of sequencing data produced by highthroughput
sequencing platforms is a fundamental process in current bioinformatics analysis. However, existing third-party probabilistic or machine-learning methods that significantly improve the accuracy of base-calls on these platforms are impractical for production use due to their computational inefficiency.

Results: We directly formulate base-calling as a blind deconvolution problem and implemented BlindCall as an efficient solver to this inverse problem. BlindCall produced base-calls at accuracy comparable to state-of-the-art probabilistic methods while processing data at rates ten times faster in most cases. The computational complexity of BlindCall scales linearly with read length making it better suited for new long-read sequencing technologies.

Availability and Implementation: BlindCall is implemented as a set of Matlab scripts available for download at http://cbcb.umd.edu/~hcorrada/secgen.

————————————————————-
————————————————————-

DALIGN – Pacbio Aligner

In DALIGN Paper, Gene Myers Delivers a Major Blow to His Biggest Competitor

It is a very neat paper, because Gene Myers showed here a way to speed up alignment algorithms without going through BWT construction. It stands out of the norm from most other NGS alignment algorithms.

After HGAP And SPAdes Comes New PacBio Assembler – MHAP

Hybrid Assembly

Very Efficient Hybrid Assembler for PacBio Data

Reader Chengxi Ye mentioned to us a few months back that he developed a very efficient genome assembler for PacBio data, which did PacBio assembly of human genome in 6 hours on a desktop computer compared to >400K CPU hours using a cluster. For those, who do not know Chengxi Ye, he was the author of SparseAssembler and also developed ultrafast basecaller BlindCall. The algorithm of SparseAssembler is more relevant in the current context, because he leveraged its compression scheme in the new PacBio/hybrid assembler.

————————————————————-
————————————————————-
3. Hardware Acceleration – GPU

This year, Professor T.-W. Lam’s group at HKU published a number of papers on hardware acceleration of bioinformatics algorithms using GPU. Although they are mostly for short reads, we make a separate category here.

BWT Construction using GPU

This new preprint from Chi-Man Liu, Ruibang Luo and Tak-Wah Lam is worth checking out (h/t: @lexnederbragt) !! BWT construction is a major problem for large NGS read files, and as we discussed in “What Problems are Being Solved by the Top NGS Bioinformaticians Today?”, almost every top bioinformatician has a BWT-construction project going on. There are several benefits of turning read files into BWT quickly.

(i) faster compression and easy transfer (Tony Cox’s rearranging of reads is a good example),

(ii) quick search through a large set of reads,

(iii) k-mer counting is an equivalent problem as BWT construction (see – “BWT Construction and K-mer Counting”),

(iv) genome assembly using SGA-type of algorithm.

BALSA for Variant Calling using GPU

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 hours to process 50-fold whole genome sequencing (~750 million 100bp paired-end reads), or just 25 minutes for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses.

MEGAHIT – Metagenome Assembly using GPU

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it avoids pre-processing like partitioning and normalization, which might compromise on result integrity. MEGAHIT generates 3 times larger assembly, with longer contig N50 and average contig length than the previous assembly. 55.8% of the reads were aligned to the assembly, which is 4 times higher than the previous. The source code of MEGAHIT is freely available at this https URL under GPLv3 license.

————————————————————–
————————————————————-

4. Future

What should one do with so many efficient algorithms? The conventional answer is to cure human diseases, but we believe the companies are over-investing in this space. Our proposal is presented in the following commentaries.

Bioinformatics at a Crossroad Again – Which Way Next?

Crossroads (ii) – Is It Too Late to Acknowledge that Systems Biology and GWAS Failed?

Crossroads (iii) – a New Direction for Bioinformatics with Twelve Fundamental Problems

To spare you from reading three long posts, here is a summary. We believe the new algorithms will be useful in solving fundamental problems in biology, and that process will eventually lead to curing health or other practical problems. The current attempts are too superficial.

To solve fundamental problems, the biologists and computer scientists need to meet halfway and learn quite a bit about each other’s field. That is the direction in which we like to take our blog from next year. All other suggestions are welcome.

————————————————————–
————————————————————-
5. New blogs

A number of new bioinformatics blogs came online in 2014. Here are a few noteworthy ones.

Gene Myers’ Dazzler Blog

James Knight’s Blog at Yale

If you think we missed any noteworthy contribution (including yours), please feel free to mention in the comment section.

—————————————————————————-

After completing this commentary, we went to biorxiv to see whether we missed any other interesting development. The following papers appear noteworthy. We cannot guarantee that all of them are excellent, and welcome readers’ comment to spare us from reading so many papers.

Lighter: fast and memory-efficient error correction without counting

It is from the same group that developed Jellyfish and Sailfish.

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Compression of short-read sequences using path encoding

Another one from the same group.

Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing. Our approach offers a new direction by sitting between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs — a common task in genome assembly — and context-dependent arithmetic coding. Supporting this method is a system, called a bit tree, to compactly store sets of kmers that is of independent interest. Using these techniques, we are able to encode RNA-seq reads using 3% — 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than recent competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

Faster sequence alignment through GPU-accelerated restriction of the seed-and-extend search space

Salzberg and colleagues going into GPU space.

Motivation: In computing pairwise alignments of biological sequences, software implementations employ a variety of heuristics that decrease the computational effort involved in computing potential alignments. A key element in achieving high processing throughput is to identify and prioritize potential alignments where high-scoring mappings can be expected. These tasks involve list-processing operations that can be efficiently performed on GPU hardware. Results: We implemented a read aligner called A21 that exploits GPU-based parallel sort and reduction techniques to restrict the number of locations where potential alignments may be found. When compared with other high-throughput aligners, this approach finds more high-scoring mappings without sacrificing speed or accuracy. A21 running on a single GPU is about 10 times faster than comparable CPU-based tools; it is also faster and more sensitive in comparison with other recent GPU-based aligners.

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Paolo Ferragina usually does innovative work. Worth taking a look.

This paper reports an initial design of new data-structures that generalizes the idea of pattern- matching in stringology, from its traditional usage in an (unstructured) set of strings to the arena of a well-structured family of strings. In particular, the object of interest is a family of strings composed of blocks/classes of highly similar stringlets, and thus mimic a population of genomes made by concatenating haplotype-blocks, further constrained by haplotype-phasing. Such a family of strings, which we dub stringomes, is formalized in terms of a multi-partite directed acyclic graph with a source and a sink. The most interesting property of stringomes is probably the fact that they can be represented efficiently with compression up to their k-th order empirical entropy, while ensuring that the compression does not hinder the pattern-matching counting and reporting queries  either internal to a block or spanning two (or a few constant) adjacent blocks. The solutions proposed here have immediate applications to next-generation sequencing technologies, base-calling, expression profiling, variant-calling, population studies, onco-genomics, cyber security trace analysis and text retrieval.

Using 2k + 2 bubble searches to find SNPs in k-mer graphs

Single Nucleotide Polymorphism (SNP) discovery is an important preliminary for understanding genetic variation. With current sequencing methods we can sample genomes comprehensively. SNPs are found by aligning sequence reads against longer assembled references. De Bruijn graphs are efficient data structures that can deal with the vast amount of data from modern technologies. Recent work has shown that the topology of these graphs captures enough information to allow the detection and characterisation of genetic variants, offering an alternative to alignment-based methods. Such methods rely on depth-first walks of the graph to identify closing bifurcations. These methods are conservative or generate many false-positive results, particularly when traversing highly inter-connected (complex) regions of the graph or in regions of very high coverage. We devised an algorithm that calls SNPs in converted De Bruijn graphs by enumerating 2k + 2 cycles. We evaluated the accuracy of predicted SNPs by comparison with SNP lists from alignment based methods. We tested accuracy of the SNP calling using sequence data from sixteen ecotypes of Arabidopsis thaliana and found that accuracy was high. We found that SNP calling was even across the genome and genomic feature types. Using sequence based attributes of the graph to train a decision tree allowed us to increase accuracy of SNP calls further. Together these results indicate that our algorithm is capable of finding SNPs accurately in complex sub-graphs and potentially comprehensively from whole genome graphs. The source code for a C++ implementation of our algorithm is available under the GNU Public Licence v3 at: https://github.com/redayounsi/2kplus2

KmerStream: Streaming algorithms for k-mer abundance estimation

Paul Melsted.

Motivation: Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k-mers (substrings of length k). Histograms of k-mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment. Results: We present KmerStream, a streaming algorithm for computing statistics for high throughput sequencing data based on the frequency of k-mers. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. This very low space requirement allows us to deal with much larger datasets than previously presented algorithms. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream and validate the accuracy on sequences from a PhiX control. As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values. Availability: The tool KmerStream is written in C++ and is released under a GPL license. It is freely available at https://github.com/pmelsted/KmerStream

SplitMEM: Graphical pan-genome analysis with suffix skips

Michael Schatz.

Motivation: With the rise of improved sequencing technologies, genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach with multiple individuals represented and analyzed together. One of the most sophisticated data structures for representing an entire population of genomes is a compressed de Bruijn graph. The graph structure can robustly represent simple SNPs to complex structural variations far beyond what can be done from linear sequences alone. As such there is a strong need to develop algorithms that can efficiently construct and analyze these graphs. Results: In this paper we explore the deep topological relationships between the suffix tree and the compressed de Bruijn graph. We introduce a novel O(n log n) time and space algorithm called splitMEM, that directly constructs the compressed de Bruijn graph for a pan-genome of total length n. To achieve this time complexity, we augment the suffix tree with suffix skips, a new construct that allows us to traverse several suffix links in constant time, and use them to efficiently decompose maximal exact matches (MEMs) into the graph nodes. We demonstrate the utility of splitMEM by analyzing the pan- genomes of 9 strains of Bacillus anthracis and 9 strains of Escherichia coli to reveal the properties of their core genomes. Availability: The source code and documentation are available open- source at http://splitmem.sourceforge.net

MUSIC: A Hybrid Computing Environment for Burrows-Wheeler Alignment for Massive Amount of Short Read Sequence Data

High-throughput DNA sequencers are becoming indispensable in our understanding of diseases at molecular level, in marker-assisted selection in agriculture and in microbial genetics research. These sequencing instruments produce enormous amount of data (often terabytes of raw data in a month) that requires efficient analysis, management and interpretation. The commonly used sequencing instrument today produces billions of short reads (upto 150 bases) from each run. The first step in the data analysis step is alignment of these short reads to the reference genome of choice. There are different open source algorithms available for sequence alignment to the reference genome. These tools normally have a high computational overhead, both in terms of number of processors and memory. Here, we propose a hybrid-computing environment called MUSIC (Mapping USIng hybrid Computing) for one of the most popular open source sequence alignment algorithm, BWA, using accelerators that show significant improvement in speed over the serial code.

Alignment-free comparison of next-generation sequencing data using compression-based distance measures

Enormous volumes of short reads data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison.
The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly.
Thus alignment-free methods need to be developed for the comparison of NGS samples of short reads.
Recently, new k-mer based distance measures such as {\it CVTree}, dS2, {\it co-phylog} have been proposed to address this problem.
However, those distances depend considerably on the parameter k, and how to choose the optimal k is not trivial since it may depend on different aspects of the sequence data.
Hence, in this paper we consider an alternative parameter-free approach: compression-based distance measures.
These measures have shown impressive performance on long genome sequences in previous studies, but they have not been tested on NGS short reads.
In this study we perform extensive validation and show that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the alignment-based approach, and from existing benchmarks in the literature.
Moreover, as these measures are parameter-free, no optimization is required and they still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels.
The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genome sequences and NGS samples of short reads.

Linear Recognition of Almost Interval Graphs

Yixin Cao

Let interval+kv, interval+ke, and interval−ke denote the classes of graphs that can be obtained from some interval graph by adding k vertices, adding k edges, and deleting k edges, respectively. When k is small, these graph classes are called almost interval graphs. They are well motivated from computational biology, where the data ought to be represented by an interval graph while we can only expect an almost interval graph for the best. For any fixed k, we give linear-time algorithms for recognizing all these classes, and in the case of membership, our algorithms provide also a specific interval graph as evidence. When k is part of the input, these problems are also known as graph modification problems, all NP-complete. Our results imply that they are fixed-parameter tractable parameterized by k, thereby resolving the long-standing open problem on the parameterized complexity of recognizing interval+ke, first asked by Bodlaender et al. [Bioinformatics, 11:49--57, 1995]. Moreover, our algorithms for recognizing interval+kv and interval−ke run in times O(6k⋅(n+m)) and O(8k⋅(n+m)), (where n and m stand for the numbers of vertices and edges respectively in the input graph,) significantly improving the O(k2k⋅n3m)-time algorithm of Heggernes et al. [STOC 2007] and the O(10k⋅n9)-time algorithm of Cao and Marx [SODA 2014] respectively.

Indexing large genome collections on a PC

Agnieszka Danek, Sebastian Deorowicz, Szymon Grabowski

Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements.
Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16–32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,μs and with up to 3 mismatches in 373\,μs on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,μs and 917\,μs.
Availability: Software and Suuplementary material: \url{this http URL}.

SAMBLASTER: fast duplicate marking and structural variant read extraction

Gregory G. Faust, Ira M. Hall

Motivation: Illumina DNA sequencing is now the predominant source of raw genomic data, and data volumes are growing rapidly. Bioinformatic analysis pipelines are having trouble keeping pace. A common bottleneck in such pipelines is the requirement to read, write, sort and compress large BAM files multiple times.
Results: We present SAMBLASTER, a tool that reduces the number of times such costly operations are performed. SAMBLASTER is designed to mark duplicates in read-sorted SAM files as a piped post-pass on DNA aligner output before it is compressed to BAM. In addition, it can simultaneously output into separate files the discordant read-pairs and/or split-read mappings used for structural variant calling. As an alignment post-pass, its own runtime overhead is negligible, while dramatically reducing overall pipeline complexity and runtime. As a stand-alone duplicate marking tool, it performs significantly better than PICARD or SAMBAMBA in terms of both speed and memory usage, while achieving nearly identical results.
Availability: SAMBLASTER is open source C++ code and freely available from this https URL

Constructing String Graphs in External Memory

In this paper we present an efficient external memory algorithm to compute the string graph from a collection of reads, which is a fundamental data representation used for sequence assembly.
Our algorithm builds upon some recent results on lightweight Burrows-Wheeler Transform (BWT) and Longest Common Prefix (LCP) construction providing, as a by-product, an efficient procedure to extend intervals of the BWT that could be of independent interest.
We have implemented our algorithm and compared its efficiency against SGA-the most advanced assembly string graph construction program.