I came across an article titled ‘A visionary, a genius, and the human genome‘ and checked what the hoopla was all about. What I found was quite shocking.
It was May 2000, the race to sequence the human genome was on, and UC Santa Cruz Biomolecular Engineering Professor David Haussler was worried.
A private firm named Celera Genomics was beating a path to the prize with a big budget and what was reported to be the most powerful computer cluster in civilian use.
Meanwhile, an international consortium of scientists—which Haussler had only recently been invited to join—was lagging behind.
Haussler, a tall man with a penchant for Hawaiian shirts, had managed to wrangle 100 Dell Pentium III processor workstations for the project. About 30 had been purchased but others had been intended for student use—until Dean of Engineering Patrick Mantey and Chancellor M.R.C. Greenwood agreed Haussler could “break in” the machines before they went into classrooms.
Each of the computers was less powerful than one of today’s smart phones, but, nevertheless, the UC Santa Cruz group was able to link them together for parallel processing, creating a makeshift “supercomputer” for the project.
What happened next was the stuff of movies: a genius and a brilliant computer scientist at an upstart university defying the odds to become the first in the world to assemble the DNA pieces of the human genome.
Why does UCSC bring up human genome project after 15 years and write a revisionist history? It is because ‘Genomics Institute director David Haussler awarded prestigious Dan David Prize‘.
Joseph Klafter, president of Tel Aviv University and chair of the Dan David Prize Board, said in a video announcement of the prize laureates, “Professor David Haussler presented the first draft of the human genome sequence and developed the Genome Browser used worldwide for interpreting genome sequences. The browser includes tools for identifying and comparing genes, for accessing information on gene structure, function, and regulation, and for revealing gene-disease relationships. Dr. Haussler introduced machine learning techniques to bioinformatics, becoming a central paradigm in the field.”
Based on the literature, Haussler did not write any genome assembler any time before May 2000 and Haussler did not contribute to genome assembly field any time after July 2000. Gene Myers (an author of BLAST and also the inventor of suffix arrays) and his student Kececioglu developed the shotgun assembly algorithm in 1991 (Kececioglu’s thesis) and published about it in many papers since then. When Myers wrote a paper in 1997 explaining that the same algorithm could be used for human genome assembly, the clowns from government-backed human genome project protested. That was the main reason of Celera’s existence. We covered all that history in the following two blog posts.
An Opinionated History of Genome Assembly Algorithms – (i)
An Opinionated History of Genome Assembly Algorithms – (ii)
At Celera, Myers developed an algorithm that had been used for countless assemblies since then (including Drosophila, human and mouse). Even today, Jason Chin of Pacbio uses Celera assembler for assembling long-reads. Myers published a review paper in April 2000 describing the software along with the Drosophila genome paper that came out in the same issue.
Anyone, who has done a genome assembly and read through those early papers, clearly understands that Haussler’s claims of ‘doing the first assembly of human genome’ is utterly deceptive. The government-backed biologists kept repeating the claim to save their own back, but Haussler should know better. Therefore, if he accepts an award for ‘doing the first assembly of human genome’ with a straight face, he is also due for Manolis Kellis award for integrity sooner or later.
The core algorithm
A naive brute-force solution to find the optimal seeds of a read would be systematically iterating through all possible combinations of seeds. We start with selecting the first seed by instantiating all possible positions and lengths of the seed. On top of each position and length of the first seed, we instantiate all possible positions and lengths of the second seed that is sampled after (to the right-hand side of) the first seed.We repeat this process for the rest of the seeds until we have sampled all seeds. For each combination of seeds, we calculate the total seed frequency and find the minimum total seed frequency among all combinations.
The key problem in the brute-force solution above is that it examines a lot of obviously suboptimal combinations. For example,…..
…The above observation suggests that by summarizing the optimal solutions of partial reads under a smaller number of seeds, we can prune the search space of the optimal solution. Specifically, givenm (withm < x) seeds and a substring U that starts from the beginning of the read, only the optimal m seeds of U could be part of the optimal solution of the read. Any other suboptimal combinations of m seeds of U should be pruned.
Motivation: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the potential of the mapper to select less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds.
Results: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-bp read in O(x×L) operations on average and in O(x×L2) operations in the worst case. We compared OSS against four state-of-the-art seed selection schemes and observed that OSS provides a 3-fold reduction of average seed frequency over the best previous seed selection optimizations.
There are often discussions about whether the bioinformaticians should spend more time thinking about better algorithms or building more user-friendly implementations. The answer is neither, as a cancer benchmarking study found out. Quoting from ‘A Comprehensive Assessment of Somatic Mutation Calling in Cancer Genomes’ –
We also detected distinct clustering of different analysts. Even though analysts used similar combinations of software in their pipelines, very few similarities were detected in their calls. The combination of software is not as critical as how each piece of software is applied and what settings are applied. A slight correlation of true positive SSM calls of pipelines using MuTect and Strelka can be seen (Figure 4). Data analysis pipelines are constructed by integrating software from diverse sources. In many instances the approaches used in the different software and the assumptions made therein are not evident to the user and many parts are black boxes. In order to account for unknowns, pipeline developers resort to calibrating their pipelines against known results, for example from a genotyping experiment done on similar samples, and by using combinations of tools for the same process step, assuming that a result shared by two different approaches has a higher likelihood to be correct. Of note is that many of the pipelines apply multiple tools for the same process step in their pipeline and then use intersects to weight calls. This practice, together with good use of blacklists, results in best outputs. No best tools emerged from our benchmark and it is also clear that no strict best threshold settings exist. However, it is clear that how analysts use their pipeline is critical for the quality of their output.
Today’s interesting paper is here –
The problem of enumerating bubbles with length constraints in directed graphs arises in transcriptomics where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq.
We present a new algorithm for enumerating bubbles with length constraints in weighted directed graphs. This is the first polynomial delay algorithm for this problem and we show that in practice, it is faster than previous approaches.
This settles one of the main open questions from Sacomoto et al. (BMC Bioinform 13:5, 2012). Moreover, the new algorithm allows us to deal with larger instances and possibly detect longer alternative splicing events.
Heng Li posted a new paper at arxiv on BGT format, which “stores the integer matrix as two positional BWTs (PBWTs), one for the lower bit and the other for the higher bit.”
The problem -
VCF/BCF (Danecek et al., 2011) is the primary format for storing and analyzing genotypes of multiple samples. It however has a few issues. Firstly, as a streaming format, VCF compresses all types of information together. Retrieving site annotations or the genotypes of a few samples usually requires to decode the genotypes of all samples, which is unnecessarily expensive. Secondly, VCF does not take advantage of linkage disequilibrium (LD), while using this information can dramatically improve compression ratio (Durbin, 2014). Thirdly, a VCF record is not clearly defined. Each record may consist of multiple alleles with each allele composed of multiple SNPs and INDELs. This ambiguity complicates annotations, query of alleles and integration of multiple data sets. At last, most existing VCF-based tools do not support expressive data query. We frequently need to write scripts for advanced queries, which costs both development and processing time. GQT (Layer et al., 2015) attempts to solve some of these issues. While it is very fast for selecting a subset of samples and for traversing all sites, it discards phasing, is inefficient for region query and is not compressed well. The observations of these limitations motivated us to develop BGT.
The solution -
Summary: BGT is a compact format, a fast command line tool and a simple web application for efficient and convenient query of whole-genome genotypes and frequencies across tens to hundreds of thousands of samples. On real data, it encodes the haplotypes of 32,488 samples across 39.2 million SNPs into a 7.4GB database and decodes a couple of hundred million genotypes per CPU second. The high performance enables real-time responses to complex queries.
Availability and implementation: here
Most people still cannot wrap their head around what is going on in Greece and how it will affect their lives. Let me summarize that for you, but not before you get an intro on the banking system.
Banking System Fact 1.
(Physical) cash is the core of the banking system. The electronic money people see in their bank accounts is a derivative of that cash. It may or may not be converted into physical cash, when you ask for it.
Banking System Fact 2.
The total amount of physical cash in the banking system is limited. The situation is similar to gold in the 1930s. Yes, it is true that the governments can print banknotes, but they are restricted by the banks to not do so.
Banking System Fact 3.
The worldwide banking system is going to blow up due to lack of collateral and the world is going cash only. You can see that process by following the events in Greece over the last few years. In a cash-only regime, very little electronic cash will be accepted anywhere, because a person/business accepting electronic money will be taking the risk of being locked within a bank holiday somewhere.
Based on above the three facts, can you tell me the people of which country in Europe have the highest amount of cash right now? Then after the blow up, who will be the most ahead?
In 1930s analogy, Greeks ran away with Europe’s ‘gold’ and that was the main purpose of this drip-drip-drip bank run over the last six months funded by ECB. Do not underestimate Varoufakis.
[Photo via allaksogolies.gr (h/t@acting-man blog)]
In the meanwhile, enjoy a number of great cartoons on Greece here.
We are surprised that the researchers are using technology from a company that was supposed to go out of business due to competition from Oxford Nanopore. A new Nature Method paper reports –
Assembly and diploid architecture of an individual human genome via single-molecule technologies
We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.
A condensed version of major claims in the paper is available from this post in Pacbio blog.
Those interested in latest publishing trends will find the following articles interesting –
1. The battle between Elsevier and Sci-hub
SCI-HUB TEARS DOWN ACADEMIA’S “ILLEGAL” COPYRIGHT PAYWALLS
In a lawsuit filed by Elsevier, one of the largest academic publishers, Sci-Hub.org is facing millions of dollars in damages. However, the site has no intentions of backing down and will continue its fight to keep access to scientific knowledge free and open. “I think Elsevier’s business model is itself illegal,” Sci-Hub founder Alexandra Elbakyan says.
With a net income of more than $1 billion Elsevier is one of the largest academic publishers in the world.
The company has the rights to many academic publications where scientists publish their latest breakthroughs. Most of these journals are locked behind paywalls, which makes it impossible for less fortunate researchers to access them.
Sci-Hub.org is one of the main sites that circumvents this artificial barrier. Founded by Alexandra Elbakyan, a researcher born and graduated in Kazakhstan, its main goal is to provide the less privileged with access to science and knowledge.
The service is nothing like the average pirate site. It wasn’t started to share the latest Hollywood blockbusters, but to gain access to critical knowledge that researchers require to do their work.
“When I was working on my research project, I found out that all research papers I needed for work were paywalled. I was a student in Kazakhstan at the time and our university was not subscribed to anything,” Alexandra tells TF.
Note: Sci-Hub is temporarily using the sci-hub.club domain name. The .org will be operational again next week.
2. The Future
High-energy physicists gave us the internet. Therefore, it is worth checking how they solved the publishing problem.
Citing and Reading Behaviours in High-Energy Physics. How a Community Stopped Worrying about Journals and Learned to Love Repositories
Contemporary scholarly discourse follows many alternative routes in addition to the three-century old tradition of publication in peer-reviewed journals. The field of High-Energy Physics (HEP) has explored alternative communication strategies for decades, initially via the mass mailing of paper copies of preliminary manuscripts, then via the inception of the first online repositories and digital libraries.
This field is uniquely placed to answer recurrent questions raised by the current trends in scholarly communication: is there an advantage for scientists to make their work available through repositories, often in preliminary form? Is there an advantage to publishing in Open Access journals? Do scientists still read journals or do they use digital repositories?
The analysis of citation data demonstrates that free and immediate online dissemination of preprints creates an immense citation advantage in HEP, whereas publication in Open Access journals presents no discernible advantage. In addition, the analysis of clickstreams in the leading digital library of the field shows that HEP scientists seldom read journals, preferring preprints instead.
N. N. Taleb embraced arxiv to publish his latest research with the following note –
Academic production is now up to 99% housekeeping, chickens**t, dealing with referees and perfecting commas, marketing, and only 1% substance. So trying the exact opposite with the following mode; wrote a paper, the shortest possible one on the idea. Put on FB here for 24 hours for crowdsourcing correction (Carl Fakhry found the math typos and inconsistencies in notation). Submitted the first draft to ArXiv, where it was posted a with day delay. Kept some typos in the first version (such as “reponse” instead of “response”) to signal I don’t give a f**k, that this is not “job market science”.
If the idea has merit, it could eventually circulate perhaps be even plagiarized, and this may even take years and years. If it lacks in rigor it will certainly die, as there is no formalism to hide the BS. All in all the non-substance part of the process turned out to be < 1 hour. Now to other, possibly, deeper things.
That is my best interpretation of the following story from an organization that gave us the arsenic paper (A Bacterium That Can Grow by Using Arsenic Instead of Phosphorus).
NASA Chief Scientist Ellen Stofan Predicts We’ll Find Signs Of Alien Life Within 10 Years
NASA’s top scientist predicts that we’ll find signs of alien life by 2025, with even stronger evidence for extraterrestrials in the years that follow.
“I think we’re going to have strong indications of life beyond Earth within a decade, and I think we’re going to have definitive evidence within 20 to 30 years,” NASA chief scientist Ellen Stofan said Tuesday during a panel event on water in the universe.
“We know where to look. We know how to look,” Stofan added. “In most cases we have the technology, and we’re on a path to implementing it. And so I think we’re definitely on the road.”
Others at the panel agreed.