The problem of assembling a large number of noisy long reads is expected to show up, no matter whether one uses Pacbio or nanopore long read technology. The good news is that it is possible to do the assembly and the quality is a lot better than what can be achieved with short reads or even Sanger reads at far higher cost. The latest MHAP paper has excellent demonstration of this point regarding the Drosophila genome.
With that knowledge, let us look at the biggest informatics challenge at hand and that is how to align many read pairs against each other. To visualize the scale of the problem, if one plans to assemble human genome from 5kB reads with 50x coverage, that means he will have to work with about 30 million reads. Aligning those 30 million reads all against all is the most time-consuming step in the assembly. That means 30 million x 30 million = 900 trillion alignments of very noisy reads, where Smith-Waterman is the most viable approach.
How to do 900 trillion alignments of long read pairs? The answer is simple – don’t do it, and that is where the main innovation of BLASR/BWAMEM, DALIGN and MHAP come in. All three methods pick only a subset of read pairs to do the actual alignment, and that subset is selected based on sharing of unusually large number of k-mers between read pairs. The difference between the algorithms is in how they find those shared k-mers. To compare the time-performances of the algorithms, one needs to understand both mathematical aspects of the respective methods and the hardware-related issues, because a large amount of data are being swapped between hard-disk, memory and processor during the course of execution.
Gene Myers’ DALIGN uses the most straightforward method. It takes a subset of reads, finds all kmers along with their locations, sorts them and identifies matching pairs. Sorting is the most time-consuming step here, and Myers uses the speed difference between L1 cache and RAM to build a clever sorting algorithm that keeps data within the L1 cache. That can give one 100x boost in access time over RAM, and moreover the conflict between processors is not an issue, because each processor has its own L1 cache.
BLASR and BWA-MEM use Burrows Wheeler transform and FM index of the collection of reads and then finds shared kmers by doing BWT-based perfect match alignment.
There are three difficulties here compared to DALIGN.
(i) Unlike human genome, where the reference is fixed, the BWT of reads have to be constructed again and again for every read library and that adds to the cost of alignment.
(ii) BWT-based FM index is saved in the RAM and it randomizes the locations compared to the genomic locations. Therefore, the RAM access is effectively random and slows down the process.
(iii) If one has many processors aligning many reads in parallel, memory bandwidth becomes the major bottleneck, because each processor wants to align its own segment with the index in the RAM.
MHAP creates a hash table of kmers in a read and attempts to find other matching ones based on the number of hash collisions. I think that is essentially what their algorithm boils down to after you remove all bells and whistles about MinHash sketch. If my interpretation is correct, maybe one can even get rid of the MinHash sketch and simply use a one-dimensional hash.
Here is the problem. The number of comparisons in their approach will scale as N^2, whereas the scaling will be closer to N or N.log(N) for other two approaches (ignoring the insignificant cost of BWT construction for BWAMEM and BLASR).
Am I missing something here?
The author of MHAP responds:
While being stuck in traffic at the Golden Gate bridge during my last San Francisco trip, an idea occurred to me. How about I do some research work to benchmark all suspension bridges in the world?
A quick check at Wikipedia gave me over one hundred candidates, leading me to restrict my ambitious plan to only the red-colored bridges. That made sense, because the red colored bridges attract unusually high number of tourists and therefore should be an ideal subset representing all traffic-heavy suspension bridges.
After carefully considering all relevant factors, I picked the following bridges -
1. 25 de Abril Bridge, Portugal
2. Xihoumen Bridge, China
3. Golden Gate Bridge, San Francisco, USA
4. Yichang Bridge, China
5. Hirado Bridge, Nagashaki, Japan
Benchmarking Criteria and Funding
For the purpose of my study, I selected the following criteria. I will drive on the above bridges through every available access road on three different days of the week and at three different times during the day – early morning, midday and later in the evening.
I collected some preliminary data on the Golden Gate bridge accordingly and applied for a grant from NIH to do benchmarking. The grant application made my hobby project appear relevant for human health, as you can see from the introduction -
Rapid access to medical care is one of the biggest concerns, and overcrowded bridges and highways can lead to fatality for those patients seeking emergency treatment. This study will investigate the time of access across the busiest suspension bridges of the world.
To make the research project even more relevant for human health, I threw in a few irrelevant and arbitrary observations such as two of the selected bridges have the highest suicide rates. What the heck !!
The project got funded !!
Finally the Paper
After completing my study, I wrote a paper and submitted to the most respected civil engineering journal. All three reviewer agreed about the importance of my study and replied using one letter technical jargon that I did not understand.
Reviewer 1: LOL.
Reviewer 2: LOL.
Reviewer 3: LOL.
My civil engineer friend working on construction projects helped me out by sending me a larger email explanation.
Just like a computer program moves data from hard-disk to memory and back to hard-disk, a bridge moves people around. Before constructing a bridge, we collect extensive amount of field-data, construct a model on expected traffic flow and design the bridge. That design includes various other factors including cost and margin of error that you did not pay attention to.
You can surely drive up and down the bridges a few times and write travel notes about how cold Golden Gate bridge is during mid-summer, but I do not see what your paper will do to the technical aspect of our field.
I disagree, because as a user, I have a right to benchmark the bridges and as a researcher, I have a right to publish those result in a scientific journal. I am wondering whether to pay PLOS One to publish my paper or send it to open access site homolog.us and save $1350.
The above story is fictitious and is not related to growing number of bioinformatics papers trying to ‘benchmark’ various software programs as black boxes without any discussion about the underlying algorithms.
Remember TPP? Here is a good example of what ‘free trade’ will look like after all countries adopt the rules written by multinational corporations.
A Colombian biology student is facing up to 8 years in jail and a fine for sharing a thesis by another scientist on a social network.
Diego Gómez Hoyos posted the 2006 work, about amphibian taxonomy, on Scribd in 2011. An undergraduate at the time, he had hoped that it would help fellow students with their fieldwork. But two years later, in 2013, he was notified that the author of the thesis was suing him for violating copyright laws.
The ‘why’ part is explained in the following paragraph –
But according to prosecutors, the move was criminal. Colombian copyright law was reformed in 2006 to meet the stringent copyright protection requirements of a free trade agreement that the country signed with the United States. Yet while the US has few criminal penalties for copyright infringement, Colombia allows only for a few exceptions.
To learn more about TPP, check -
TPP: One More Attempt to End Internet Freedom after Failed SOPA/PIPA
James Knight (@knightjimr) joined Yale as a research scientist and director of bioinformatics, and also started a blog, where you can find a lot of useful information. For those, who do not know him, he developed the Newbler assembler for 454 reads and also collaborated with Eugene Myers in the 90s, before Myers became world famous (or rather made Craig Venter world famous).
Graphs, Alignments, Variants and Annotations, pt. 1
Note: This post, and following posts, were triggered by the recent post by Heng Li, describing his proposed GFA format for assembly graph data. After a twitter request for comments, my first tweet was that this was the right person to do the format. My second tweet, after reading the post, was that the format “was a miss…” Heng rightly asked me to explain, and this begins that explanation.
In recent years, there has been a growing trend toward developing a standard file format for assemblies, specifically a graph-based format (since all of the assemblers’ internal data structures hold a graph of one sort or another). A number of years ago, it was the AMOS project that made the attempt, two to three years ago, it was FASTG, and now this GFA format proposal.
When the FASTG format was really being talked about at the several conferences two years ago, where by luck or by happenstance most of the lead authors of the major NGS assemblers were present, I came really close to commenting on the format, but refrained. The main reason is that I didn’t see anything wrong with the format itself…it does a very good job capturing the structure that nearly all of the assemblers build, and is a usable format for organizing the assemblies’ data. The thoughts in my mind were all about who was doing the design, and, in a related way, why the FASTA format that we all know and love (or don’t love) is called the FASTA format.
Graphs, Alignments, Variants and Annotations, pt. 2
Instead of heading towards the more theoretic graph design, the day after writing part 1 of what is turning out to be a series, I focused on concrete software changes that might answer the first question I posed in the previous post (“If you recast the problem as that of calling ‘annotated variants’, can you speed up the current pipelines?”), and that is what this post will be about. The more theoretic graph design will be coming, but I’m still thinking through Heng Li’s update describing previous work on graph algorithms, as well as a recent slide deck he presented (the picture of where he is coming from is becoming clearer…).
Last year, we commented that a large number of bioinformatics groups were working on constructing BWT from huge Illumina libraries. The group of Yongchao Liu, Thomas Hankeln and Bertil Schmidt (who previously worked on various GPU algorithms) was not mentioned, but they also joined the fray with their parabwt release.
ParaBWT is a new and practical parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data, which has a linear space complexity with a small constant factor. The performance of ParaBWT has been evaluated using two sequences generated from two human genome assemblies: the Ensembl Homo sapiens assembly and the human reference genome, on a workstation with two Intel Xeon X5650 hex-core CPUs and 96 GB RAM, running the Ubuntu 12.04 LTS operating system. Our performance comparison to FMDindex and Bwt-disk reveals that on 12 CPU cores, ParaBWT runs up to 2.2 times faster than FMD-index, reducing the runtime from 26.56 hours to 12.34 hours for a sequence of about 60 billion nucleotides, and up to 99.0 times faster than Bwt-disk.
Paper is still forthcoming. Please be satisfied with the sourceforge page for the time being.
Adam Phillippy and collaborators submitted a new paper to biorxiv -
Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing
We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
A few comments -
1. Primary innovation
They use MinHash sketch.
In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.
A large scale evaluation has been conducted by Google in 2006  to compare the performance of Minhash and Simhash algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling and using Minhash and LSH for Google News personalization.
Here is a figure from the paper, explaining how it works -
2. SPAdes vs MHAP comparison
Using PBcR-MHAP, microbial genomes can be completely assembled from long reads in roughly the same time required to generate incomplete assemblies from short reads. For example, PBcR-MHAP was able to accurately resolve the entire genome of E. coli K12 using 85X of SMRT reads in 4.6 CPU hours, or 20 minutes using a modern 16-core desktop computer. In comparison, the state-of-the-art39 SPAdes assembler40 required 4.1 CPU hours to assemble 85X Illumina reads from the same genome. Both short- and longread assemblies are highly accurate at the nucleotide level (>99.999%), but the short-read assembly is heavily fragmented and contains more structural errors (Supplementary Table S4, Supplementary Fig. S3). Our initial SMRT assembly does contain more single-base insertion/deletion (Indel) errors, but polishing it with Quiver (requiring an additional 6.6 CPU hours) resulted in the lowest number of consensus errors of all assemblies (11 vs. 96 for SPAdes).
3. Assembly cost
Exponentially lower costs have democratized DNA sequencing, but assembling a large genome still requires substantial computing resources. Cloud computing services offer an alternative for researchers that lack access to institutional computing resources. However, the cost of assembling long-read data using cloud computing has been prohibitive. For example, using Amazon Web Services (AWS), the estimated cost to generate the D. melanogaster PBcR-BLASR assembly is over $100,000 at current rates, an order of magnitude higher than the sequencing cost. With MHAP, this cost is drastically reduced to under $300. To expand access to the PBcR-MHAP assembly pipeline, we have provided a free public AWS image as well as supporting documentation for non-expert users that reproduces the D. melanogaster assembly presented here in less than 10 hours using AWS. Allocating additional compute nodes, which would marginally increase costs, could further reduce assembly time. For E. coli, the total cost of PBcR-MHAP assembly and Quiver polishing is currently less than $2. With MHAP, assembly costs are now a small fraction of the sequencing cost for most genomes, making long-read sequencing and assembly more widely accessible.
4. Overall assessment
a) The introduction of MinHash sketch in assembly is very innovative. Also, it is very helpful that they gave full demonstration of their assembly technique for small and large genomes.
b) Aligner comparison -
In addition to speed, MHAP is a highly sensitive overlapper. We evaluated the sensitivity and specificity of MHAP versus BLASR32, the only other aligner currently capable of overlapping SMRT reads. BWA-MEM, SNAP, and RazerS were also evaluated, but their current versions were unable to reliably detect noisy overlaps (Supplementary Note 2).
Not sure, why they overlooped DALIGN. Also, BWA-MEM is tuned to near perfect alignment (k=19) and a small change in parameter will give what they are looking for. So, “their current versions were unable to reliably detect noisy overlaps” can be easily and painlessly corrected by sending a quick email to Heng Li.
c) The cost section is somewhat sloppy and is probably written for marketing department of PacBio, not serious audience. Why is cloud cost of E. coli given with Quiver and Drosophila without Quiver? Also, what can one project from $300 for Drosophila to the cloud cost of assembling human-sized genome using the same method? Does it grow as O(N) or O(N^2) with respect to size? We do not need exact numbers, but the order of growth of assembly time with genome size is something any user will look for.
Edit. A comment from @aphillippy
d) With regard to the general field of bioinformatics, it is indeed great news that assembling complete genomes, which is one of the long-standing problems, is going to be solved satisfactorily. That means bioinformaticians will have to rethink their strategy for future -
Bioinformatics at a Crossroad Again – Which Way Next?
The Biggest Lesson from Microsoft’s Recent Battle with the US Government
A court ruling involving Microsoft’s offshore data storage offers an instructive lesson on the long reach of the US government-and what you can do to mitigate this political risk.
A federal judge recently agreed with the US government that Microsoft must turn over its customer data that it holds offshore if requested in a search warrant. Microsoft had refused because the digital content being requested physically was located on servers in Ireland.
Microsoft said in a statement that “a US prosecutor cannot obtain a US warrant to search someone’s home located in another country, just as another country’s prosecutor cannot obtain a court order in her home country to conduct a search in the United States.”
The judge disagreed. She ruled that it’s a matter of where the control of that data is being exercised, not of where the data is physically located.
This ruling is not at all surprising. It’s long been crystal clear that the US will aggressively claim jurisdiction if the situation in question has even the slightest, vaguest, or most indirect connection. Worse yet, as we’ve seen with the extraterritorial FATCA law, the US is not afraid to impose its own laws on foreign countries.
One of the favorite pretexts for a US connection is the use of the US dollar. The US government claims that just using the US dollar-which nearly every bank in the world does-gives it jurisdiction, even if there were no other connections to the US. It’s quite obviously a flimsy pretext, but it works.
Recently the US government fined (i.e., extorted) over $8 billion from BNP Paribas for doing business with countries it doesn’t like. The transactions were totally legal under EU and French law, but illegal under US law. The US successfully claimed jurisdiction because the transactions were denominated in US dollars-there was no other US connection.
The subject of human genetics does not interest us, but we got curious after noticing a lot of digital ink being spent on Nicholas Wade’s recent book (see here, here and here). In fact, population geneticists do not like the book so much that they decided to write a letter to whoever to protest against it. That goes against the unwritten rule we were taught in physics, where we were not expected to popularize a document that we did not like and let obscurity take over it.
This undue publicity got us interested in other books written by Wade, and we found a gem that is highly worth reading. Wade published “THE NOBEL DUEL: The Scientists’ 21-Year Race to Win the World’s Most Coveted Prize” in 1981. The reviews are fantastic, but where can we find the book?
ANOTHER KIND OF SCIENTIFIC METHOD
”The Nobel Duel” outlines the dark side of science – the sort of rivalry that featured so prominently in James D. Watson’s ”The Double Helix,” but without the humor. The absence of humor undoubtedly stems from the personalities of the two protagonists, who wandered far from the traditional courtesies of their profession to gain what more talented but less driven scientists failed to acquire.
Roger Guillemin and Andrew Schally, the first a French-born physician, the other a blunt Polish refugee who started his research career as a laboratory technician, are scarcely typical scientists. One-time colleagues, they formed two teams that, in their determination to gain the same objective and gain it first, ignored almost all of the scientific niceties.
In a profession that insists on giving fair credit to every scientist who has contributed in any way to a specific finding – ”If I have seen further … it is by standing upon the shoulders of Giants,” in Newton’s phrase – Guillemin and Schally were notably reluctant to acknowledge each other’s efforts. Guillemin consistently omitted Schally’s work from his review articles, and Schally typically shrugged aside a dispute among members of his own team with the comment: ”What did it matter to me? It’s my lab – I got the glory anyway.” Similarly, both men refused to cooperate in the traditional manner of scientific teams embarked on the same research problem. ”Don’t talk to the enemy,” Mr. Schally told his team. ”I would have preferred a more open relationship, in which we could have got on the telephone and saved each other time,” conceded one former member of the Guillemin group. And in an academic world that frowns on secrecy as anathema – or at least more suitable to industrial laboratories – Guillemin and Schally often withheld information from scientific meetings for fear of alerting the opposition to a fruitful advance.
The tenor of the rivalry comes through best in an exchange of letters quoted by Mr. Wade: ” ‘Your somewhat derogatory and deprecating remarks … surprised me as the attack on Pearl Harbor surprised the U.S. Navy,’ Schally complained in a letter written after a conference in which Guillemin had allegedly slighted Schally’s contributions. Guillemin’s reply to the charges was carefully calculated to vex his volatile adversary to the utmost: ‘I have no comments except to say that I am neither your conscience nor your psychiatrist,’ he wrote.”
If their relationship was hardly typical of the world of science, neither was the quest on which the two men embarked. They sought to discover and characterize the hormones that the brain uses to control temperature, reproduction, growth and a host of other body functions. These peptide hormones, we now know, exist in minute amounts in the part of the brain called the hypothalamus. Since the biological residue left by a single fingerprint may contain more material than the hormones from a thousand hypothalami, the search for the elusive hormones became an exercise in large-scale biotechnology. At times, Guillemin and Schally spent more time in slaughterhouses than they did in their laboratories. For one experiment, which in spite of all efforts failed, Guillemin collected half a million sheep hypothalami; Schally, with smaller resources, made do with a paltry 200,000 pig hypothalami at a time.
Any major piece of research in modern biology demands specialization greater than any single scientist can possess, and both men assembled their own teams of chemists, biochemists, physiologists and other experts to make their assault on the hormones, and on the rival group. Indeed, toward the end of the race, both became glorified lab managers rather than active scientists, rounding up funds, suggesting experiments and firing ripostes at their opposite number, but rarely carrying out the grueling repetition of everyday laboratory work.
Guillemin and Schally also threw their energies into lobbying scientists who might help them to land the Nobel Prize. Once they had successfully -and almost simultaneously -succeeded in identifying the obscure hormones, they assiduously courted Rolf Luft, a leading member of the Nobel awards committee. An editorial by Guillemin in the influential New England Journal of Medicine prominently credited Luft’s research, although it had little obvious connection with Guillemin’s. Schally planned to give Luft the honor of performing the clinical trials of one of the chemicals that emerged from his team’s work.
Science writer Wade gets off to a rousing start: the Nobel prize ceremonies in Stockholm in 1977, with Rosalyn Yalow seated between her co-sharers in the physiology-or-medicine prize, Roger Guillemin and Andrew Schally–a proverbial rose between thorns. Guillemin and Schally, chafing at their proximity, glower with the mutual enmity that has been the scandal of the endocrinology world. For years each had tried–racing with the other–to isolate and synthesize the brain chemicals reputed to control the release of pituitary hormones, small proteins with horrendous names like luteinizing hormone releasing factor (LRF), thyroid releasing factor (TRF), etc. The book then flashes back to the two men’s early years–and ends with a coda on what rivalry and competition may or may not do for science. (Yalow was not involved in the rivalry, but her part in developing sensitive assay techniques paid off in identifying the chemicals involved.) Guillemin, a Frenchman, migrated to Texas via Montreal, and now presides over a prestigious lab at the Salk Institute in La Jolla. Schally, a Pole, settled in the New Orleans VA Hospital with colleagues at Tulane. For five years the rivals actually collaborated, existing in a state of muted Cold Wax. Then hostilities broke out with a vengeance. We see Guillemin behaving like a suave Napoleonic commander, missing no opportunity to excoriate Schally with innuendo and irony. Schally, a soldier’s son who also recruited troops, was more blunt, direct, and aggressive. To their credit, both men were captivated with Englishman Geoffrey Harris’ theory that the brain controlled the pituitary gland by releasing chemicals from the hypothalmus, a small cluster of nerve cells linked to the pituitary. Their belief and preliminary work flew in the face of conventional wisdom, but was vindicated over time. Wade details the progress of the rival labs, the final round-the-clock race, and the scandalous secrecy in the supposedly share-the-findings world of science. It is at times an exhilarating suspense tale, at times a poignant reading of some of the second lieutenants or spiritual leaders (like Harris). But in the end one wearies of these humorless, hard-driven megalomaniacs. What they did, ironically, was not brilliant science like Watson-Crick, but the tedious grinding up of thousands of tons of pig or sheep hypothalmi for analysis and purification. Good writing, juicy quotes, but dreary, rather dreadful protagonists.
Your idiocy and complete lack of competency seems to be only surpassed by Stalin!