I am starting to read Nick Lane’s new book – “The Vital Question: Energy, Evolution, and the Origins of Complex Life”.
Nick Lane is a UK-based biochemist, who writes about the unsolved fundamental problems in evolutionary biology. I earlier read his ‘Life Ascending: The Ten Great Inventions of Evolution’ and enjoyed it. Here is a short description of the new book.
To explain the mystery of how life evolved on Earth, Nick Lane explores the deep link between energy and genes.
The Earth teems with life: in its oceans, forests, skies and cities. Yet there’s a black hole at the heart of biology. We do not know why complex life is the way it is, or, for that matter, how life first began. In The Vital Question, award-winning author and biochemist Nick Lane radically reframes evolutionary history, putting forward a solution to conundrums that have puzzled generations of scientists.
For two and a half billion years, from the very origins of life, single-celled organisms such as bacteria evolved without changing their basic form. Then, on just one occasion in four billion years, they made the jump to complexity. All complex life, from mushrooms to man, shares puzzling features, such as sex, which are unknown in bacteria. How and why did this radical transformation happen?
The answer, Lane argues, lies in energy: all life on Earth lives off a voltage with the strength of a lightning bolt. Building on the pillars of evolutionary theory, Lane’s hypothesis draws on cutting-edge research into the link between energy and cell biology, in order to deliver a compelling account of evolution from the very origins of life to the emergence of multicellular organisms, while offering deep insights into our own lives and deaths.
Both rigorous and enchanting, The Vital Question provides a solution to life’s vital question: why are we as we are, and indeed, why are we here at all?
Lane and Martin first proposed the above ideas in their 2010 Nature paper – The energetics of genome complexity.
Gene Myers kindly shared his slides for our blog readers. This will help understand the notes from yesterday’s talk better.
A Pacbio bioinformatics workshop is currently going on at NIST in Maryland, where Gene Myers (who is not as smart as David Haussler ) is the keynote speaker. Interested readers are encouraged to follow the SMRTBFX hashtag in twitter. The slides from the talk are posted here.
Based on twitter reports, Gene Myers said –
1. The noise in PacBio reads is basic thermodynamic noise -> almost total random.
2. Perfect assembly possible iff: 1) poisson sampling 2) random error 3) reads longer than repeats.
3. Longer reads take away some fun on solving repeat problem.
4. The repeats can be resolved by long reads & leveraging heterogeneity of repeats. No assembler has reached theoretical limits yet.
5. The future is here – right out of the box reference genomes now possible
6. It is easier to share data interfaces than software interfaces. Lets define interfaces so we can play together — good idea Gene! Same principle for the software group at celera
7. GM talked about “classic time-space trade-off” for sequence alignment. Using trace points save time with minimul space overhead
7. Gene Myers’ Dazzler blog with all code.
8. Trace-points scale liberally in both time and space
9. Trace Point Concept better than bam and sam — for pacbio reads. We would need converters to bam and sam for other tools
10. No evidence aligns to A implies A is bad. Vote for the quality of each A segment.
11. It is a statistic to vote or this and it gives information about what the quality is.
12. Votes of consensus between B reads (aligned reads) to A read (anchor read) can be used for intrinsic QV
13. Perfect (near) is within reach. dazzler DB frameowrk for assembly pipline. Trace points for intrinsic quality values.
14. All data needed for assembler is now stored in 2 TB of data.
15. Uses patching to create uniform quality reads its the artifacts to get a good string graph.
16. Scrubbers should remove as little real data as possible while removing all the artifacts from the data
17. No real world string graphs look perfect because of insufficient scrubbing
18. challenge for DAscrub is incorporating repeat analysis. Avail. For collaborative use. Not stable enough for wide distribution
19. DAscrub: assemble @PacBio data without error correction step. Coming soon.
…….and possibly a snub at the doctor, who is currently running NIH and was his competitor in the Human Genome Project.
Thanks to @infoecho and others for covering the workshop for us.
Readers may find this paper by Bryan Howie and colleagues interesting. It solves an important problem (finding pairing pattern of TCR alpha and beta chains) using high-throughput sequencing and an elegant mathematical model.
The T cell receptor (TCR) protein is a heterodimer composed of an alpha chain and a beta chain. TCR genes undergo somatic DNA rearrangements to generate the diversity of T cell binding specificities needed for effective immunity. Recently, high-throughput immunosequencing methods have been developed to profile the TCR alpha (TCRA) and TCR beta (TCRB) repertoires. However, these methods cannot determine which TCRA and TCRB chains combine to form a specific TCR, which is essential for many functional and therapeutic applications. We describe and validate a method called pairSEQ, which can leverage the diversity of TCR sequences to accurately pair hundreds of thousands of TCRA and TCRB sequences in a single experiment. Our TCR pairing method uses standard laboratory consumables and equipment without the need for single-cell technologies. We show that pairSEQ can be applied to T cells from both blood and solid tissues, such as tumors.
Their probabilistic model is very familiar to me, because twelve years back I used it in a different context – for finding clustered proteins from noisy large-scale protein-protein interaction data. Here is the basic idea. If A has 10 friends and B has 10 friends, what is the probability that they have 9 common friends (by chance alone)? If the computed probability is very low and the actual measurement shows that they indeed have 9 common friends, that means A and B are strongly associated. In case of TCR alpha and beta chains in their experimental set-up, that strong association implies their combining to form heterodimers. For large-scale protein-protein interaction data in yeast, significant association appeared to show functional similarity.
The work for the above TCR paper was done by Adaptive Technologies, a very creative Seattle-based company that was founded by scientists from Fred Hutch Cancer Research Institute.
I finished reading a wonderful book that I recommend to everyone trying to understand evolution. It was written by paleontologist David M. Raup, who passed away last month. My curiosity was piqued by the Sandwalk blog, which recommended it as one of the top five books on evolution.
Raup argued that given the high rate (99.9%) of species ever living on earth dying out (going extinct), the mode of extinction should play a dominant role in evolution, but what if a large fraction of those extinctions took place due to rare natural catastrophes (such as rocks hitting the earth from sky)? Shouldn’t then the biologists pay less attention to fitness of genes and more to the unpredictable natural events? For a quick introduction to the topic covered, readers may take a look at Raup’s 1994 PNAS paper – “The role of extinction in evolution”.
The extinction of species is not normally considered an important element of neodarwinian theory, in contrast to the opposite phenomenon, speciation. This is surprising in view of the special importance Darwin attached to extinction, and because the number of species extinctions in the history of life is almost the same as the number of originations; present-day biodiversity is the result of a trivial surplus of originations, cumulated over millions of years. For an evolutionary biologist to ignore extinction is probably as foolhardy as for a demographer to ignore mortality. The past decade has seen a resurgence of interest in extinction, yet research on the topic is still at a reconnaissance level, and our present understanding of its role in evolution is weak. Despite uncertainties, extinction probably contains three important elements. (i) For geographically widespread species, extinction is likely only if the killing stress is one so rare as to be beyond the experience of the species, and thus outside the reach of natural selection. (ii) The largest mass extinctions produce major restructuring of the biosphere wherein some successful groups are eliminated, allowing previously minor groups to expand and diversify. (iii) Except for a few cases, there is little evidence that extinction is selective in the positive sense argued by Darwin. It has generally been impossible to predict, before the fact, which species will be victims of an extinction event.
The promoters of ‘Big Data’ science argue that by collecting increasingly large amount of data and by processing the data with clever algorithms, they can make fundamental scientific discoveries (or other social contributions). Many others point out the lack of discoveries compared to what the same people had been promising for years, to which the ‘Big Data’ supporters say that they have not collected enough data yet.
In this article, I present three rules to show that the basic premise of ‘Big Data’ is faulty and explain them with many examples. The rules are-
Quality does not scale, but noise scales.
The noise can only be reduced by high-quality algorithms.
Rules 1 and 2 are valid at all scales.
Let me start with the simple example of genome assembly from short reads using de Bruijn graphs. What happens, when one throws in tons and tons of reads to get higher coverage? The number of high-quality k-mers (i.e. those truly matching the genome) remain the same, but the number of noisy k-mers scale with more data. As you add more coverage, the number of noisy k-mers start to overwhelm the system. First thing you see is that your computer RAM space getting filled, leading to periodic crashes. At even higher coverage, the de Bruijn graph has all kinds of tips and bubbles formed in addition to real contigs.
Both problems are solvable, but they need high-quality algorithms.
Think carefully about the implication of the last sentence within the context of Rule 3. Rule 3 says that Rule 1 and 2 apply at all scales. That means they apply for de Bruijn graph, but they also apply for researchers developing high-quality algorithms.
Let us say, the government throws in a lot of money to get high-quality algorithms. What happens? Well, high-quality algorithms reach a plateau, but noise scales with money. As a result, the space of new bioinformatics tools look similar to de Bruijn graph at 200x coverage. How to figure out what is good and what is not? Well, maybe you need high-quality algorithms to figure out which algorithms are worth using
I thought about the mentioned rules for months and could not find any way to get out of the constraints imposed by them. In the meanwhile, the scientific world keeps marching to the tune of ‘Big Data’ in all aspects. Every aspect of it, including ranking papers based on citation (or God forbid – retweets) is vulnerable. The same goes for automated annotation of public databases based on existing data. This last point will be the focus of another forthcoming post.
Stephano Lonardi posted a link to his paper, and believe it or not, I was looking for the same paper in my directory, while writing this blog post today, but could not locate it. That is no coincidence, because my initial thinking six months back was influenced by his paper posted at biorxiv. At that time, I was working on 1000x E. coli data posted by the SPAdes group and made similar observation in the assembly stats. The explanation (dBG getting more noise) seemed obvious, but it is also known that SPAdes manages to produce a good assembly from 1000x data. That observation inspired the rest of the thinking about need for high-quality algorithm to overcome noise.
You may also realize that throwing more money to solve assembly problem would not have obtained the high-quality solution, but instead polluted the space with too much noise (i.e. low-quality algorithm). It was rather Pevzner’s work of over two decades that got us there. That is the essence of Rule 3 in one human context.
If software license is the only thing that stops you from using wonderful Kallisto algorithm/program, maybe this github code can help. As another advantage, it comes with GPL license (could be BSD if not for Jellyfish dependence) and you can build your code on top of it by using RapMap as a module. Pseudoalignment is a powerful lightweight concept and we can expect more applications to use this module.
What is RapMap?
RapMap is a testing ground for ideas in lightweight / pseudo / quasi transcriptome alignment. That means that, at this point, it is very experimental and there are no guarantees on stability / compatibility between commits. Eventually, I hope that RapMap will become a stand-alone lightweight / pseudo / quasi-aligner that can be used with other tools.
Lightweight / pseudo / quasi-alignment is the term I’m using here for the type of information required for certain tasks (e.g. transcript quantification) that is less “heavyweight” than what is provided by traditional alignment. For example, one may only need to know the transcripts / contigs to which a read aligns and, perhaps, the position within those transcripts rather than the optimal alignment and base-for-base CIGAR string that aligns the read and substring of the transcript.
There are a number of different ways to collect such information, and the idea of RapMap (as the repository grows) will be to explore multiple different strategies in how to most rapidly determine all feasible / compatible locations for a read within the transcriptome. In this sense, it is like an all-mapper; the alignments it outputs are intended to be (eventually) disambiguated (Really, it’s more like an “all-best” mapper, since it returns all hits in the top “stratum” of lightweight/pseudo/quasi alignments). If there is a need for it, best-mapper functionality may be added in the future.
How fast is RapMap?
It’s currently too early in development for a comprehensive benchmark suite, but, on a synthetic test dataset comprised of 75 million 76bp paired-end reads, mapping to a human transcriptome with ~213,000 transcripts, RapMap takes ~ 10 minutes to align all of the reads on a single core (on an Intel Xeon E5-2690 @ 3.00 GHz) — if you actually want to write out the alignments — it depends on you disk speed, but for us it’s ~15 minutes. Again, these mapping times are on a single core, and before any significant optimizations (the code is only about a week and a half old) — but RapMap is trivially parallelizable and can already be run with multiple threads.
Some more details from Rob –
Genome centers publish genome papers in glam journals and then move on to more genome assemblies and more glam papers. In the meanwhile, researchers trying to do biology using the published genomes are stuck with defective assemblies and noise-prone downstream analysis. The biggest source of noise comes from ‘clean’ Illumina reads. Short reads are noisy because of being short, and that noise manifests into incorrect scaffolding in case of genome assembly. Unless researchers recognize this problem and actively work on it instead of generating more and more (1K, 10K, etc.) ‘genome assemblies’ and genome papers, we will be stuck with massive amount of erroneous genome assemblies.
The Pacific oyster Crassostrea gigas, a widely cultivated marine bivalve mollusc, is becoming a genetically and genomically enabled model for highly fecund marine metazoans with complex life-histories. A genome sequence is available for the Pacific oyster, as are first-generation, low density, linkage and gene-centromere maps mostly constructed from microsatellite DNA makers. Here, higher density, second-generation, linkage maps are constructed from more than 1100 coding (exonic) single-nucleotide polymorphisms (SNPs), as well as 66 previously mapped microsatellite DNA markers, all typed in five families of Pacific oysters (nearly 172,000 genotypes). The map comprises 10 linkage groups, as expected, has an average total length of 588 centiMorgans (cM), an average marker-spacing of 1.0 cM, and covers 86% of a genome estimated to be 616 cM. All but seven of the mapped SNPs map to 618 genome scaffolds; 260 scaffolds contain two or more mapped SNPs, but for 100 of these scaffolds (38.5%), the contained SNPs map to different linkage groups, suggesting widespread errors in scaffold assemblies. The 100 misassembled scaffolds are significantly longer than those that map to a single linkage group. On the genetic maps, marker orders and inter-marker distances vary across families and mapping methods, owing to an abundance of markers segregating from only one parent, to widespread distortions of segregation ratios caused by early mortality, as previously observed for oysters, and to genotyping errors. Maps made from framework markers provide stronger support for marker orders and reasonable map lengths and are used to produce a consensus high-density linkage map containing 656 markers.
In the thread on valuation of Oxford Nanopore, reader George spiggot commented –
Samanta, that’s now how valuations work. If they sold 5000 sequencers, and people used 2 flow cells a month, that’s 5-10 Million a month, or 60-70M per year. Then you have price earnings ratio, have a look at Illuminas. Their value is many times earnings. In fact today, based on a 1% earnings miss they lost 3x ONTs value. ILMN make about $1B per year, but are worth 33Bn. Why is that. Investors are not paid back from revenue, it’s not a pay day loan. As for pacbio, their value is low, and stays low, because it hard to see how their revenue can scale. With ONT you can see how it scales, and moves into new markets.
George, thank you for the comment, because it is perfect time to discuss valuation these days. As you noticed, Illumina’s market valuation got whacked by over $2B for no apparent good reason. Why?
Please do not consider the following discussion as financial advice, but here is how valuation works.
There is one and only one way to properly value a business.
1. You first figure out what you can earn risk-free for your money. That answer is 3% at present.
2. Then you determine what the business returns as dividend. Is it sufficiently more than 3% after considering various risks? There are all kinds of risks in running a business, ranging from simple hacker attack (e.g. Ashley Madison) to explosions in chemistry lab, Chinese stock market crash and FDA not liking your business plan. A person investing in a business needs to account for all those risks and factor in sufficient premium over 3%. Investing in a more risky business requires higher premium.
Question: Tech companies do not pay dividend and reinvest profits in growth. Are they valued at zero?
If you do not get any dividend, you may substitute the profit margin of a rapidly growing company as dividend provided you have sufficient expectation of getting the money returned to you as dividend in future. Also remember that this act of substitution adds another risk factor, because a bird in hand is worth two in bush. For all you know, the CEO may ultimately hike his salary and leave with the profit.
At opposite pole from honest valuation sits ‘Ponzi valuation’ or ‘get rich quick’ valuation. It is based on how fast one can get rich by passing the hot potato to the next wannabe genius. No other explanation needed.
Ponzi valuation relies on myth-building. You see an abundance of visionaries trying to predict the rosy future and justify the high stock prices at the peak of a boom. They mock anyone not agreeing with them as ‘inexperienced’.
When a person with money meets a person with experience, the person with the experience winds up with the money and the person with the money winds up with the experience.
Ponzi valuations rule under two conditions – (i) high debt, (ii) large amount of opium in the market. People playing with other people’s money (OPM), such as managers of mutual/pension funds, compare companies against companies rather than looking at absolute return. That is because no pension/mutual fund manager lost job by following what the others are doing, but many were fired for trying to be safe. It is even better for their careers, when they can borrow to increase leverage and short-term performance.
Please note that our use of ‘Ponzi’ comes from Hyman Minsky (check Minsky’s financial instability hypothesis).
Which valuation is right?
Rather than right or wrong, the more pertinent question would be ‘which valuation is risky’? Risky behavior may sometimes appear exciting, and people get carried away without properly accounting for the underlying risks.
In the context of ONT, your comment makes two assumptions justifying the $1.5B valuation – (i) ILMN’s valuation, (ii) ONT is as good as ILMN. How valid are those assumptions?
i) Current valuation of ILMN – The market valuation of ILMN, based on which you justified the other numbers, appears to be mostly Ponzi valuation. There had been huge pent up demand for sequencing in 2009-2012 giving an impression of exponential growth. As of 2015, sequencing is not the biggest bottleneck, but bioinformatics is. Even within bioinformatics, the first level of hurdle (storage and preliminary analysis of large amount of data) is not the largest constraint. Now the question is how to make biologically/medically relevant discoveries from that data, given that many of the low-hanging fruits are gone. That difficult task requires more than sequencing.
ii) ‘Too much sequencing’ factor – Let’s assume that ONT becomes as successful as ILMN. In that case, the saturating demand for more sequencing will be satisfied by ILMN, Pacbio, ONT and BGI with their new instrument, just to mention a few. They will eat each others margin and reduce the current ‘70%’ profit margin of ILMN.
iii) How good is ONT? – Speaking of ONT itself, I have been watching them for years and they always under-delivered, whereas the scientists surrounding them (e.g. Mick Watson, Birney) hyped up the prospects without giving a true picture of reality. Jared Simpson is the only brain around there and his work on E. coli assembly from HMM of electrical signals is commendable. However, I do not see Mick Watson’s ‘prediction’ of human genome assembly from ONT this year from that success in E. coli.
Hyping up by distorting numbers creates obstacles for anyone trying to design efficient programs. Take the example of characterization of errors. Those numbers are very important for designing any bioinformatics program, and in case of Pacbio, Mark Chaisson and Glenn Tesler wrote a fantastic first paper providing an accurate description of error profile. When it comes to ONT data, I am completely lost. Are the errors random or do they have positional bias? Do the errors have AT/GC bias, homopolymer bias? What is the relative distribution of substitutions and indels?
iv) Factors not under control for ONT – Irrespective of ONT’s performance, what will happen, if ILMN market cap falls to 2-3x revenue due to stock market correction? Will ONT be able to raise more funds? If ONT cannot raise money, it will have to survive on its revenue/profit and burning of existing cash. That appears to be another big and unaccounted risk factor going forward that is not currently taken into consideration.
How about stock market valuation?
Stock market valuation goes from ‘honest valuation’ at the bottom of a bust to ‘Ponzi valuation’ at the peak of a boom. That is essentially what Minsky described in his instability hypothesis. Moreover, you can also play this rule in reverse and count the number of visionaries predicting rosy future to figure out where in the business cycle we currently are.
Here is a good example full cycle of Ponzi–>honest valuation.
Sycamore Networks: From $45 Billion to Zilch
By SCOTT THURM and BEN FOX RUBIN
Updated Feb. 1, 2013 5:38 p.m. ET
There was a time when Sycamore Networks Inc. was the next big thing—a leader in the race to direct digital traffic across the Internet.
This is not that time. On Friday, Sycamore all but went out of business. The Chelmsford, Mass., company said it had completed the sale of its remaining product line and that its shareholders had voted to dissolve the company. Sycamore ended the day with a market value of about $66 million, a humbling end for a company that in March 2000 was worth nearly $45 billion.
Twitter is buzzing with a photo posted by @san_kaido –
They are deformed daisies near Fukushima, but is the deformation caused by radiation? Fasciation is the alternate explanation.
Fasciation (or cresting) is a relatively rare condition of abnormal growth in vascular plants in which the apical meristem (growing tip), which normally is concentrated around a single point and produces approximately cylindrical tissue, instead becomes elongated perpendicularly to the direction of growth, thus, producing flattened, ribbon-like, crested, or elaborately contorted tissue. Fasciation may also cause plant parts to increase in weight and volume in some instances. The phenomenon may occur in the stem, root, fruit, or flower head. Some plants are grown and prized aesthetically for their development of fasciation. Any occurrence of fasciation has several possible causes, including hormonal, genetic, bacterial, fungal, viral and environmental causes.
The answer is not either/or, because fasciation can also be caused by genetic mutation from radiation. Speaking of radiation, readers may also find the following post from 2012 relevant –
Severe Abnormalities Found in Fukushima Butterflies