Merely three months back, Francis Collins was telling everyone how his organization could have developed Ebola vaccine day before yesterday, if Congress gave him little more money. That claim was derided by many including Mike Eisen in his blog who suggested that Collins invest in basic research instead of chasing the latest trend. Now all that is forgotten. Today, in yet another article published in medical literature, Francis Collins declared where any new money will be going (hint –
personalized precision medicine).
A New Initiative on Precision Medicine
The last paragraph sums it all –
This initiative will also require new resources; these should not compete with support of existing programs, especially in a difficult fiscal climate. With sufficient resources and a strong, sustained commitment of time, energy, and ingenuity from the scientific, medical, and patient communities, the full potential of precision medicine can ultimately be realized to give everyone the best chance at good health.
Readers should note that Collins had been promising the moon and asking for more money for a long time. Here is an article he published in JAMA in 1999 telling you what 2010 would look like, if you gave him more money. None of these promises ever came true.
A HYPOTHETICAL CASE IN 2010
General visions of gene-based medicine in the future are useful, but many health care providers are probably still puzzled by how it will affect the daily practice of medicine in a primary care setting. A hypothetical clinical encounter in 2010 is described here.
John, a 23-year-old college graduate, is referred to his physician because a serum cholesterol level of 255 mg per deciliter was detected in the course of a medical examination required for employment. He is in good health but has smoked one pack of cigarettes per day for six years. Aided by an interactive computer program that takes John’s family history, his physician notes that there is a strong paternal history of myocardial infarction and that John’s father died at the age of 48 years.
To obtain more precise information about his risks of contracting coronary artery disease and other illnesses in the future, John agrees to consider a battery of genetic tests that are available in 2010. After working through an interactive computer program that explains the benefits and risks of such tests, John agrees (and signs informed consent) to undergo 15 genetic tests that provide risk information for illnesses for which preventive strategies are available. He decides against an additional 10 tests involving disorders for which no clinically validated preventive interventions are yet available.
A cheek-swab DNA specimen is sent off for testing, and the results are returned in one week (Table 1). John’s subsequent counseling session with the physician and a genetic nurse specialist focuses on the conditions for which his risk differs substantially (by a factor of more than two) from that of the general population. Like most patients, John is interested in both his relative risk and his absolute risk.
John is pleased to learn that genetic testing does not always give bad news — his risks of contracting prostate cancer and Alzheimer’s disease are reduced, because he carries low-risk variants of the several genes known in 2010 to contribute to these illnesses. But John is sobered by the evidence of his increased risks of contracting coronary artery disease, colon cancer, and lung cancer. Confronted with the reality of his own genetic data, he arrives at that crucial “teachable moment” when a lifelong change in health-related behavior, focused on reducing specific risks, is possible. And there is much to offer. By 2010, the field of pharmacogenomics has blossomed, and a prophylactic drug regimen based on the knowledge of John’s personal genetic data can be precisely prescribed to reduce his cholesterol level and the risk of coronary artery disease to normal levels. His risk of colon cancer can be addressed by beginning a program of annual colonoscopy at the age of 45, which in his situation is a very cost-effective way to avoid colon cancer. His substantial risk of contracting lung cancer provides the key motivation for him to join a support group of persons at genetically high risk for serious complications of smoking, and he successfully kicks the habit.
Motivation: Coalescent based simulation software for genomic sequences allows the efficient in silico generation of short and medium-sized genetic sequences. However, the simulation of genome-size data sets as produced by Next-Generation sequencing (NGS) is currently only possible using fairly crude approximations.
Results: We present the Sequential Coalescent with Recombination Model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale data sets with an essentially correct linkage structure.
Availability: The open source implementation scrm is freely available at https://scrm.github.io under the conditions of the GPLv3 license.
We like to propose to sequence the genome of your bioinformatician Jason Chin for your ‘The Most Interesting Genome in the World’ contest. Only picture of him available online is shown below.
Jason Chin has been writing amazing open access assembly programs including HGAP, PBdagcon and more recently Falcon. We heard that even your long-read competitor, who pronounced Pacbio dead, uses his programs in their analysis.
Sequencing his genome will help us discover the bioinformatics gene, which we will transfect inside the brain cells of every young biology student to make sure they can assemble long, noisy reads. Or we can even do CRISPR/cas, which seems to be the hottest thing around these days. Going forward, we can even expand this into a larger GWAS project involving those who understand the importance of long reads and those who do not.
As you can see, the possibilities are endless with this small initial investment. Therefore, we earnestly request you to give serious consideration to our proposal.
What is pbdagcon?
pbdagcon is a tool that implements DAGCon (Directed Acyclic Graph Consensus) which is a sequence consensus algorithm based on using directed acyclic graphs to encode multiple sequence alignment.
It uses the alignment information from blasr to align sequence reads to a “backbone” sequence. Based on the underlying alignment directed acyclic graph (DAG), it will be able to use the new information from the reads to find the discrepancies between the reads and the “backbone” sequences. A dynamic programming process is then applied to the DAG to find the optimum sequence of bases as the consensus. The new consensus can be used as a new backbone sequence to iteratively improve the consensus quality.
While the code is developed for processing PacBio(TM) raw sequence data, the algorithm can be used for general consensus purpose. Currently, it only takes FASTA input. For shorter read sequences, one might need to adjust the blasr alignment parameters to get the alignment string properly.
The code and the underlying graphical data structure have been used for some algorithm development prototyping including phasing reads, pre-assembly and a work around to generate consensus from intermediate Celera Assembler outputs.
The initial graphical algorithm was a pure python implementation. Cython was then use to speed it up.
Check out the example/ directory to see how to use it.
Falcon: a set of tools for fast aligning long reads for consensus and assembly
The Falcon tool kit is a set of simple code collection which I use for studying efficient assembly algorithm for haploid and diploid genomes. It has some back-end code implemented in C for speed and some simple front-end written in Python for convenience.
Dan Graur has a new paper describing various functional categories of DNA. This is supposedly to rectify the scientific errors made by ENCODE, but that could be hard given – “it is difficult to get a man to understand something, when his salary depends upon his not understanding it!” Change is impossible as long as NHGRI grant managers like Mike Pazin continue to shower money on their ENCODE friends.
Therefore we propose a revised version of the abstract to classify researchers instead of DNA.
The pronouncements of the ENCODE Project Consortium regarding “junk DNA” exposed the need for an classification of PIs according to their willingness to understand evolution. In the classification scheme presented here, we divide the PIs into “functional PI,” i.e., PIs who understand evolutionary concepts, and “rubbish PI,” i.e., PIs who do not. Functional PI is further subdivided into “literal PI” and “indifferent PI.” In literal PI, the researcher is further contributing to development of evolutionary theories; in indifferent PI, researcher only uses the concepts in his work. Rubbish PI is further subdivided into “junk PI” and “garbage PI.” Junk PI neither enhances nor misuses evolutionary concepts in his paper and, hence, does not pollute the literature. Garbage PI, on the other hand, decreases the quality of its papers. Papers from garbage PI exist in the literature only because the reviewers are neither omnipotent nor omniscient. Each of these four PI categories can be (1) funded and funded by NHGRI, (2) funded but not by NHGRI, or (3) not funded. The affiliation of a PI to a particular category may change during his career: functional PI may become junk PI, junk PI may become garbage PI, rubbish PI may become functional PI, and so on, however, determining the evolutionary understanding of a PI must be based on its present status rather than on its potential to change (or not to change) in the future. Changes in classification are divided in to pseudoPI, Birney PI, zombie PI, and hide PI.
Several readers were upset, when we called Russia the freest country two years back (Russia – Best Assembler, Best French Actor and now Freest Country !!). We encourage them to explain the latest oddity.
Greece had a major shift in government after the election last week. The election ended eurocrat banker occupation. Among the most visible post-election changes, the new government is not afraid of people any more. Here are two pictures of the main square in Athens before and after new government came into power.
In another post-election change. now that Greece is free from occupation, it is leaning toward Russia. That is quite a shift in mood from the older days, when free people used to run west as fast as possible.
Speaking of genome assembly, SPAdes still remains to be the best assembler, although free people from Hong Kong are catching up :). We will write more on SPAdes vs MEGAhit comparison in a future post.
Accurate base-calling is the step in nanopore sequencing with possibly the most room for improvement. The base-calling from raw electrical signals requires multiple steps – calling the ‘events’ implying something passing through the pore or not, segmenting those events and finally assigning nucleotide bases to them.
Although the electrical signals from Oxford Nanopore sequencers can be accessed from various early access participants, the fully automated segmentation program of the company is kept as proprietary. Readers interested in analyzing the raw electrical signals to see whether they can improve the quality of analysis will find a new paper by reader gasstationwithoutpumps useful.
Segmentation of Noisy Signals Generated By a Nanopore
Nanopore-based single-molecule sequencing techniques exploit ionic current steps produced as biomolecules pass through a pore to reconstruct properties of the sequence. A key task in analyzing complex nanopore data is discovering the boundaries between these steps, which has traditionally been done in research labs by hand. We present an automated method of analyzing nanopore data, by detecting regions of ionic current corresponding to the translocation of a biomolecule, and then segmenting the region. The segmenter uses a divide-and-conquer method to recursively discover boundary points, with an implementation that works several times faster than real time and that can handle low-pass filtered signals.
Here is briefly what they do –
Boundary points are identified by our segmenter one at a time using a recursive algorithm. We start by considering the entire event as a single segment, then consider each possible boundary to break it into two segments. To avoid edge effects and ensure that all segments have at least a minimum duration, only potential boundaries more than the minimum segment length from the ends of the segment are considered.
Each possible boundary is scored using a log-likelihood function (Eq. 1). If the maximal score is above a threshold, the segment is split and the two subsegments are recursively segmented. The recursion terminates either when the segment to split is less than twice the minimum segment length or no score within the segment exceeds the threshold.
The codes are available freely from this github page, which also includes a lot of information on the algorithm –
This process for detecting and segmenting events in nanopore signals should run in real time; either segmenting a stream of data as it comes in or quickly segmenting an event shortly after its completion. To test the speed of the algorithm, the event detector was implemented in Python and the segmenter was implemented in Cython, a language that allows the efficiency of C within Python. The current implementation is designed to segment full events and is available at the first author’s public github page.
According to NIH Director Francis Collins, NIH is so short of money that it took funds away from almost sure Ebola vaccine development to support other projects. Three months back, he claimed that NIH could have found Ebola vaccine by now, if it had little more money.
“NIH has been working on Ebola vaccines since 2001. It’s not like we suddenly woke up and thought, ‘Oh my gosh, we should have something ready here,'” Collins told The Huffington Post on Friday. “Frankly, if we had not gone through our 10-year slide in research support, we probably would have had a vaccine in time for this that would’ve gone through clinical trials and would have been ready.”
It’s not just the production of a vaccine that has been hampered by money shortfalls. Collins also said that some therapeutics to fight Ebola “were on a slower track than would’ve been ideal, or that would have happened if we had been on a stable research support trajectory.”
This month, the same agency announced a new mega-project on million human genome sequencing and ‘precision medicine’, which is utterly wasteful as argued by Professor Ken Weiss in three informative blog posts. Dr. Weiss has been making similar arguments since late 1990s (see “How many diseases does it take to map a gene with SNPs?”) and has been right so far. The promises made by Francis Collins prior to human genome sequencing project remain mostly unfulfilled, despite having no shortage of money and technology. Only thing that came out of personalized genomics hype is the new name ‘precision medicine’ and more hype.
In the light of all this, we wonder whether it is worth having a serious discussion before throwing away billions of dollars into a new mega boondoggle. Has anyone addressed the scientific objections made by Dr. Weiss? Have we determined why the previous claims made by Francis Collins since mid-90s remain pipe dreams?
Parts of blog posts of Dr. Weiss are reproduced below.
Your money at work…er, waste: the million genomes project
Bulletin from the Boondoggle Department
In desperate need for a huge new mega-project to lock up even more NIH funds before the Republicans (or other research projects that are actually focused on a real problem) take them away, or before individual investigators who actually have some scientific ideas to test, we read that Francis Collins has apparently persuaded someone who’s not paying attention to fund the genome sequencing of a million people! Well, why not? First we had the (one) human genome project. Then after a couple of iterations, the 1000 genomes project, then the hundred thousand genomes ‘project’. So, what next? Can’t just go up by dribs and drabs, can we? This is America, after all! So let’s open the bank for a cool million. Dr Collins has, apparently, never met a genome he didn’t like or want to peer into. It’s not lascivious exactly, but the emotion that is felt must be somewhat similar.
We now know enough to know just what we’re (not) getting from all of this sequencing, but what we are getting (or at least some people are getting) is a lot of funds sequestered for a few in-groups or, more dispassionately perhaps, for a belief system, the belief that constitutive genome sequence is the way to conquer every disease known to mankind. Why, this is better than what you get by going to communion every week, because it’ll make you immortal so you don’t have to worry that perhaps there isn’t any heaven to go to after all.
What’s ‘precise’ about ‘precision’ medicine (besides desperate spin)?
The million genomes project
In the same breath, we’re hearing that we’ll be funding a million genomes project. The implication is that if we have a million whole genome sequences, we will have ‘precision medicine’ (personalized, too!). But is that a serious claim or is it a laugh?
A million is a large number, but if most variation in gene-based risk is due, as mountains of evidence shows, to countless very rare variants, many of them essentially new, and hordes of them perhaps per person, then even a million genome sequences will not be nearly enough to yield much of what is being promised by the term ‘precision’! We’d need to sequence everybody (I’m sure Dr Collins has that in mind as the next Major Slogan, and I know other countries are talking that way).
Don’t be naive enough to take this for something other than what it really is: (1) a ploy to secure continued funding perpetrated on his Genome Dream, but in the absence of new ideas and the presence of promises any preacher would be proud of, and results that so far clearly belie it; and (2) a way to protect influential NIH clients with major projects that no longer really merit continued protection, but which will be included in this one (3) to guarantee congressional support from our representatives who really don’t know enough to see through it or who simply believe or just want cover for the idea that these sorts of thing (add Defense contracting and NASA mega-projects as other instances) are simply good for local business and sound good to campaign on.
Yes, Francis Collins is born-again with perhaps a simplistic one-cause worldview to go with that. He certainly knows what he’s doing when it comes to marketing based on genetic promises of salvation. This idea is going to be very good for a whole entrenched segment of the research business, because he’s clever enough to say that it will not just be one ‘project’ but is apparently going to have genome sequencing done on an olio of existing projects. Rationales for this sort of ‘project’ are that long-standing, or perhaps long-limping, projects will be salvaged because they can ‘inexpensively’ be added to this new effort. That’s justified because then we don’t have to collect all that valuable data over again.
But if you think about what we already know about genome sequences and their evolution, and about what’s been found with cruder data, from those very projects to be incorporated among others, a million genome sequences will not generate anything like what we usually understand the generic term ‘precision’ to mean. Cruder data? Yes, for example, the kinds of data we have on many of these ongoing studies, based on inheritance, on epidemiological risk assessment, or on other huge genomewide mapping has consistently shown that there is scant new serious information to be found by simply sequencing between mapping-marker sites. The argument that the significance level will raise when we test the actual site doesn’t mean the signal will be strong enough to change the general picture. That picture is that there simply are not major risk factors except, certainly, some rare strong ones hiding in the sequence leaf-litter of rare or functionless variants.
Of course, there will be exceptions, and they’ll be trumpeted to the news media from the mountain top. But they are exceptions, and finding them is not the same as a proper cost-benefit assessment of research priorities. If we have paid for so many mega-GWAS studies to learn something about genomic causation, then we should heed the lessons we ourselves have learned.
Secondly, the data collected or measures taken decades ago in these huge long-term studies are often no longer state of the art, and many people followed for decades are now pushing up daisies, and can’t be followed up.
Thirdly, is the fact that the epidemiological (e.g., lifestyle, environment…) data have clearly been shown largely to yield findings that get reversed by the next study down the pike. That’s the daily news that the latest study has now shown that all previous studies had it wrong: factor X isn’t a risk factor after all. Again, major single-factor causation is elusive already, so just pouring funds on detailed sequencing will mainly be finding reasons for existing programs to buy more gear to milk cows that are already drying up.
Fourth, many if not even most of the major traits whose importance has justified mega-epidemiological longterm follow up studies, have failed to find consistent risk factors to begin with. But for many of the traits, the risk (incidence) has risen faster than the typical response to artificial selection. In that case, if genomic causation were tractably simple, such strong ‘selection’ should reflect those few genes whose variants respond to the changed environmental circumstances. But these are the same traits (obesity, stature, diabetes, autism,…..) for which mapping shows that single, simple genetic causation does not obtain (and, again, that assumes that the environmental risk factors purportedly responsible are even identified, and the yes-no results just mentioned above shows otherwise).
Worse than this, what about the microbiome or the epigenome, that are supposedly so important? Genome sequencing, a convenient way to carry on just as before, simply cannot generally turn miracles in those areas, because they require other kinds of data (and, not available from current sequencing samples nor, of course, from deceased subjects even if we had stored their blood samples).
Somatic mutation: does it cut both ways?
Beware, million genome project!
What has this got to do with the million genome project? An important fact is that SoMu’s are in body tissues but are not part of the constitutive (inherited) genome, as is routinely sampled from, say, a cheek swab or blood sample. The idea underlying the massive attempts at genomewide mapping of complex traits, and the new culpably wasteful ‘million genomes’ project by which NIH is about to fleece the public and ensure that even fewer researchers get grants because the money’s all been soaked up by DNA sequencing, Big Data induction labs, is that we’ll be able to predict disease precisely, from whole genome sequence, that is, from constitutive genome sequence of hordes of people. We discussed this yesterday, perhaps to excess. Increasing sample size, one might reason, will reduce measurement error and make estimates of causation and risk ‘precise’. That is in general a bogus self-promoting ploy, among other reasons because rare variants and measurement and sample errors or issues may not yield a cooperating signal-to-noise ratio.
So I think that the idea of wholesale, mindless genome sequencing will yield some results but far less than is promised and the main really predictable result, indeed precisely predictable result, is more waste thrown onto mega-labs, to keep them in business.
Anyway, we’re pretty consistent with our skepticism, nay, cynicism about such Big Data fads as mainly grabs in tight times for funding that’s too long-lasting or too big to kill, regardless of whether it’s generating anything really useful.
A few days back we mentioned about the unfortunate experience of Adam Eyre-Walker, a well-respected evolutionary biologist, who was asked to provide bank statements to show that he was indeed poor enough to get PLoS fee waiver.
Poor People Need to Provide Bank Statements to Publish in PLoS
Thanks to social media storm and Mike Eisen’s efforts, that PLoS policy seemed to have been changed for good (or at least until the next social media storm about Bill Gates trying to get fee waiver at PLoS reverses it). PLoS published the following letter in their webpage and explained that the request to Adam Eyre-Walker was an error.
PLOS Clarifies its Publication Fee Assistance Policy
PLOS would like to clarify the policy by which authors can apply for fee assistance in the form of a partial or full fee waiver. Authors who are unable to obtain financial support from institutional, library, government agencies or research funders to pay for publication are not expected to self-fund these costs. In short, PLOS does not expect authors to fund publication fees through their personal funds.
Based on a misinterpretation of the organization’s Publication Fee Assistance (PFA) policy, requests were made or implied for individual financial information from certain PFA applicants. This was done in error. We regret any confusion it may have caused for applicants and any other members of the community. The process for communicating with PFA applicants and the language used on relevant PLOS application forms have now been corrected.
PLOS is committed to ensuring that the availability of research funding is not a barrier to publishing scientific research. Our Global Participation Initiative covers the full cost of publishing for authors from low-income countries, and covers most of the cost for authors from middle-income countries. Our PFA program has always and continues to support those with demonstrated need who are unable to pay all or part of their publication fees.
We wrote about Meraculous assembler over two years back (see Genome Assembly – MERmaid and Meraculous) and even then it was noteworthy for implementing perfect hash data structure for storing the graph. Reader J. Zola pointed out that the program improved significantly since then.
You should also point out that in SC 2014 Aydin Buluc et al. published what is probably the most scalable parallel version of de Bruijn graph construction. The algorithm has been designed for and incorporated into Meraculous. More here: http://dl.acm.org/citation.cfm?id=2683642.
Readers can access the paper on scalable parallel dBG construction here. Performance improvement from days to seconds appears very impressive !
De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous fragments called reads. We study optimized parallelization of the most time-consuming phases of Meraculous, a state-of-the-art production assembler. First, we present a new parallel algorithm for k-mer analysis, characterized by intensive communication and I/O requirements, and reduce the memory requirements by 6.93×. Second, we efficiently parallelize de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We provide a novel algorithm that leverages one-sided communication capabilities of the Unified Parallel C (UPC) to facilitate the requisite fine-grained parallelism and avoidance of data hazards, while analytically proving its scalability properties. Overall results show unprecedented performance and efficient scaling on up to 15,360 cores of a Cray XC30, on human genome as well as the challenging wheat genome, with performance improvement from days to seconds.
We posted about a number of publications from David Tse’s group investigating fundamental limits for assembly algorithms (see here and here). In their
latest paper, they look at how noise in the reads affect those fundamental limits.
While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on this problem by establishing a critical read length, as a function of the genome and the error rate, above which perfect assembly is guaranteed. For several real genomes, including those from the GAGE dataset, we verify that this critical read length is not significantly greater than the read length required for perfect assembly from reads without errors.
It should be obvious that if the reads are 100% noisy, no assembly will be possible no matter how long the reads are. What is then the ‘fundamental limit’ for the error rate in the reads?
The answer depends on the distribution of errors as explained in the paper.
Our results show that for several actual genomes, if we are in a dense-read model with reads 20-40% longer than the noiseless requirement `crit(s), perfect assembly feasibility is robust to erasures at a rate of about 10%. While this is not as optimistic as the message from , we emphasize that we consider an adversarial error model. When errors instead occur at random locations, it is natural to expect less stringent requirements.