The Most Difficult Problem in Computational Biology

The real revolution in computational biology is taking place without much public attention. Temple Smith, who is famous for Smith-Waterman algorithm, is using computational methods to work on what Francis Crick called the most difficult problem in genetics – the evolutionary origin of ribosome and translational apparatus. This ‘translation’ problem got buried inside the genetic code, but Woese argued that its evolutionary aspect needs to be understood properly to get full understanding of the evolution of organisms (including large eukaryotic mammals).

The Evolution of the Ribosome and the Genetic Code

The evolution of the genetic code is mapped out starting with the aminoacyl tRNA-synthetases and their interaction with the operational code in the tRNA acceptor arm. Combining this operational code with a metric based on the biosynthesis of amino acids from the Citric acid, we come to the conclusion that the earliest genetic code was a Guanine Cytosine (GC) code. This has implications for the likely earliest positively charged amino acids. The progression from this pure GC code to the extant one is traced out in the evolution of the Large Ribosomal Subunit, LSU, and its proteins; in particular those associated with the Peptidyl Transfer Center (PTC) and the nascent peptide exit tunnel. This progression has implications for the earliest encoded peptides and their evolutionary progression into full complex proteins.

Readers are encouraged to check other recent papers of Smith written in collaboration with Hartman (such as GTPase and the origin of the ribosome).

Speaking of relevance of translation to the development of mammals, readers may check work from Maria Barna at Stanford –

RNA regulons in Hox 5′ UTRs confer ribosome specificity to gene regulation

Emerging evidence suggests that the ribosome has a regulatory function in directing how the genome is translated in time and space. However, how this regulation is encoded in the messenger RNA sequence remains largely unknown. Here we uncover unique RNA regulons embedded in homeobox (Hox) 5′ untranslated regions (UTRs) that confer ribosome-mediated control of gene expression. These structured RNA elements, resembling viral internal ribosome entry sites (IRESs), are found in subsets of Hox mRNAs. They facilitate ribosome recruitment and require the ribosomal protein RPL38 for their activity. Despite numerous layers of Hox gene regulation, these IRES elements are essential for converting Hox transcripts into proteins to pattern the mammalian body plan. This specialized mode of IRES-dependent translation is enabled by an additional regulatory element that we term the translation inhibitory element (TIE), which blocks cap-dependent translation of transcripts. Together, these data uncover a new paradigm for ribosome-mediated control of gene expression and organismal development.

Speaking of difficult problems, interestingly Smith also has his eye on origin of the cilia and eukaryotes. For a full list of other problems, readers may take a look at –

Crossroads (iii) – a New Direction for Bioinformatics with Twelve Fundamental Problems

Proper Role of Bioinformatics in Biology

Bioinformatics is a tool in biology just like a PCR machine. If you understand your tool well, you can do other work better. However, the tool should not dominate over real research.

Nobel laureate Sydney Brenner wrote in 1998 (h/t: @dangraur) –

Statements that “we have come to do biology in a new way” or “there is a new paradigm in biological research” are now commonplace. Nobody seems to be satisfied by a single good experiment that gives a precise answer to a well formulated question, which was the old way we did biology. On the contrary there is now a belief that a mass attack on parallel fronts can provide a database of all the information in one concerted effort, and all we need is a computer programme that will give everybody all the knowledge they need.

Much of this stems from genome projects, especially the effort to sequence the human genome. However, there are subtle differences between the different cultures that have generated the sequences. The yeast genome was sequenced by a co-operative venture of many small individual scientific groups, who had a deep interest in the result. Surrounding the project was an even larger group of yeast geneticists and molecular biologists who knew how to use the sequence in their experimental work. The sequence was the path to the genes of yeast; there are now ways to access all of the genes directly and the page in the Book of Life devoted to yeast is written in real DNA. The sequence has become the tool for research that it was expected to be, and not a end in itself.

It is likely that the genome projects for Caenorhabditis elegans and Drosophila will have the same impact on their fields, mainly because of the large number of researchers who can immediately make use of the product. It is with the vertebrate genomes that we find a new idea coming to the fore. Roughly speaking, the proponents have come to believe that computers can extract biological significance directly from DNA sequences.

This approach has generated two new areas of activity. One, Bioinformatics, is simply pretentious; the other, Functional Genomics, is ridiculous. The latter uses the former to try to find function from the sequences of genes. I don’t think that there are any university departments devoted to these subjects but there are certainly a growing number of companies doing one or both. Other areas are now adopting the same approach of systematically assembling data by factory methods. The proteome is emerging from two-dimensional electrophoresis of proteins, but is still a poor relation of the genome. I expect to see the glycome and the lipome next.

Actually, there is already a perfectly good name for the science of studying gene function; it used to be called Genetics. Geneticists have always been interested in function and have always used their research as a way — perhaps the way — to analyse complex functions of organisms. The sequences of genes and, better still, the pieces of DNA that correspond to the genes, replace what could only be achieved by the mutant hunt in classical experimental genetics; they are tools and not ends in themselves. We will still need to find out how each gene works and piece together the elaborate network of gene interactions by the old paradigm of experiment. In fact, sequences also offer us the possibility of interpreting Nature’s experiments in evolution, but that will come later as a consequence of knowing the genetics of contemporary organisms.

Bioinformatics has its place. Its main activity has been beneficial in that masses of data can now be easily reached and used for research. However, the idea that sequence data can have other information added to them which will give us knowledge of function is surely misplaced. For this, we must do more than repackage what is known; the computers must compute, and in order to do this we need a theory that we can test. The subject that will be developed will be one that should be called Theoretical Biology, but as this has a bad name we call it Computational Biology.

The siliconization of biology has been successful — perhaps too successful — in one area, which is in the way we communicate. I note that many researchers are now spending several hours a day with their e-mail, reading and sending messages to an increasing number of correspondents. I fear that this is going to put everybody in an electronic committee in permanent session. I have installed a very narrow pore filter on my e-mail; I have someone else read it and print out what I need to know. I started this mainly because a dentist in Philadelphia sent me voluminous messages about his new theories on the brain, and also because I cannot remember my password.

More than ten years ago, when electronic mail was still a novelty, I was given an account on a private network. Three passwords were requested to enter the system, and had to be renewed at frequent intervals for reasons of security. I used all twenty amino acids and the five nucleotide bases, and I then started on them again but written backwards, which makes a surprising list from which I particularly liked enilav, but there is also a enicuelosi, which has a good Italian ring to it. At the risk of compromising my computer security I shall disclose my favourite password which is ELCID, usually with some number attached because greedy computers want six characters. This password lets me login to the computer but apparently another one is needed for e-mail, which is a secret even from me. I am also toying with the idea of having a special address for bioinformaticists and functional geneticists to reach me. How about [email protected]???

A Glimpse into the Techno-gaga World

Cold Spring Harbor lab’s Michael C. Schatz wrote
The Next 20 Years of Genome Research” with a view of the future and a set of recommendations for the newborns and alike.

Children born this summer will grow up in a drastically different world than the childhoods of the current graduating class or those from twenty years ago. The class of 2035 will have unprecedented access to information, quantitative techniques, and biotechnologies that will be used to manipulate biological systems in currently unimaginable detail. While the foundations of biology will continue to be observation, experimentation, and interpretation, the technologies and approaches used will become ever more powerful and quantitative. More so than ever, we need to revise the curriculum to integrate computational and quantitative analysis as early as possible into their training so they are ready for the world ahead [51].

My recommendation to the class of 2035 is to embrace the integration of fields that is forming modern biology. To be competitive, you will need to establish a broad interdisciplinary foundation of math and sciences as well as strong communication skills. One of the most important skills you can develop early is computer programming. While sequencing technologies and other instrumentation will come and go over the next 20 years, biology will only continue to grow its dependency on computational analysis. And much like learning to speak a new language is often easier the younger you begin, learning to “speak” to a computer seems to follow a similar path. But I also encourage you to experiment with the “wet” side of biology as early as possible as well, since this will help you to appreciate the data you work with and put you in a position to run your own experiments end to end. Indeed, the most profound advances often occur at the intersection of new biotechnology and new quantitative analysis, when you can be the first to generate a novel data type that is used to unravel a mystery of how the world operates. Finally, always remember to keep focused on the most important problems that you can hope to address.

Thankfully our newborns did plenty of experiment on the ‘wet’ side of biology by wetting their beds.


Here is how the future will be delivered to you, according to Michael Schatz –

Over the next twenty years, though, our power for doing so will greatly improve building on the pioneering work of ENCODE [41], the Roadmap Epigenomics Project [42], and similar projects that are starting to provide detailed annotations as to the roles and evolution of sequences all throughout the genome.

Readers may note that ENCODE has been thoroughly discredited and ‘Roadmap Epigenomics Project’ is being led by Manolis Kellis, who was accused of fraud and scientific misconduct for his earlier work by Berkeley mathematician Lior Pachter.

An inconvenient request

One of the great things about conferences is that there is time to chat in person with distant friends and collaborators. Last July, at the ISMB conference in Berlin, I found myself doing just that during one of the coffee breaks. Suddenly, Manolis Kellis approached me and asked to talk in private. The reason for his interruption: he came to request that I remove an arXiv post of mine, namely “Comment on ‘Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions“, a response to a paper by Ward and Kellis. Why? He pointed out that my arXiv post was ranking highly on Google. This was inconvenient for him, he explained, while insisting that it was wrong of me to post a criticism of his work on a forum where he could not directly respond. He suggested that it would be best to work out any issues I might have with his paper offline. Unfortunately, there was the inconvenient truth that arXiv postings cannot be removed. Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent.

My initial confusion quickly turned to anger. After all, my arXiv comment had been rejected from Science where I had submitted it as a technical comment on the Ward-Kellis paper. I had then put it on the arXiv as a last resort measure to at least have some record of my concerns publicly accessible. How is this wrong? Can one not critique the work of Manolis Kellis?


Speaking of futuristic forecasts, readers may also take a look at the following one written by NIH Director Francis Collins in 1999 –

Medical and Societal Consequences of the Human Genome Project


General visions of gene-based medicine in the future are useful, but many health care providers are probably still puzzled by how it will affect the daily practice of medicine in a primary care setting. A hypothetical clinical encounter in 2010 is described here.

John, a 23-year-old college graduate, is referred to his physician because a serum cholesterol level of 255 mg per deciliter was detected in the course of a medical examination required for employment. He is in good health but has smoked one pack of cigarettes per day for six years. Aided by an interactive computer program that takes John’s family history, his physician notes that there is a strong paternal history of myocardial infarction and that John’s father died at the age of 48 years.

To obtain more precise information about his risks of contracting coronary artery disease and other illnesses in the future, John agrees to consider a battery of genetic tests that are available in 2010. After working through an interactive computer program that explains the benefits and risks of such tests, John agrees (and signs informed consent) to undergo 15 genetic tests that provide risk information for illnesses for which preventive strategies are available. He decides against an additional 10 tests involving disorders for which no clinically validated preventive interventions are yet available.

John’s subsequent counseling session with the physician and a genetic nurse specialist focuses on the conditions for which his risk differs substantially (by a factor of more than two) from that of the general population. Like most patients, John is interested in both his relative risk and his absolute risk.
John is pleased to learn that genetic testing does not always give bad news — his risks of contracting prostate cancer and Alzheimer’s disease are reduced, because he carries low-risk variants of the several genes known in 2010 to contribute to these illnesses. But John is sobered by the evidence of his increased risks of contracting coronary artery disease, colon cancer, and lung cancer. Confronted with the reality of his own genetic data, he arrives at that crucial “teachable moment” when a lifelong change in health-related behavior, focused on reducing specific risks, is possible. And there is much to offer. By 2010, the field of pharmacogenomics has blossomed, and a prophylactic drug regimen based on the knowledge of John’s personal genetic data can be precisely prescribed to reduce his cholesterol level and the risk of coronary artery disease to normal levels. His risk of colon cancer can be addressed by beginning a program of annual colonoscopy at the age of 45, which in his situation is a very cost-effective way to avoid colon cancer. His substantial risk of contracting lung cancer provides the key motivation for him to join a support group of persons at genetically high risk for serious complications of smoking, and he successfully kicks the habit.

Readers know how that story turned out. The personalized genomics company ’23 and Me’ was banned by FDA from giving medical advice in 2013. Lior Pachter wrote “23andme genotypes are all wrong“, but later changed the title to “Multiple testing an issue for 23andme” with a note that the previous title was technically correct.

Breakdown of Trust

Berkeley mathematician Lior Pachter exposed Manolis Kellis once again in his recent post – ‘Pachter’s P-value Prize‘. Previously he accused Kellis of fraud in a number of recent papers. This time the charges are of lesser degree of being stupid, but on an early and highly cited paper of Kellis with only two computational authors – Manolis Kellis and Eric Lander. Eric Lander is the founding director of Broad institute and co-chair of U.S. President’s Council of Advisors on Science and Technology.


While we were reading the highly interesting comments in Pachter’s blog, one of them caught our attention. It came from Pavel Pevzner, who built the SPAdes assembler and bioinformatics learning tool Rosalind with collaborators from Algorithmic Biology lab in Russia. Pevzner and Compeau are also the authors of ‘Bioinformatics Algorithms – An Active Learning Approach’, which is being used as textbook in their Coursera course. Pevzner wrote –

Thank you Eric, Lior, Manolis, and everybody who provided comments on this blog. The KBL paper is now textbook material: just two weeks ago, we taught it to 1000s students in our Coursera “Bioinformatics Algorithms“ class (, and there will be a section on KBL in the 2nd volume of our “Bioinformatics Algorithms” textbook ( to come out in July 2015.

In view of the discussion here, I and my co-instructor Phillip Compeau have decided to make a break in the Coursera course and to ask students to spend a week reading KBL and related papers, thinking about the posts on the blog, and perhaps even submitting solutions for the P-value prize challenge :-) We have excellent students: 13% of them hold Ph.D.s, and 34% hold M.S. degrees. I hope they will contribute to the discussion in a meaningful and CONCILIATORY way. Even more importantly, I hope they will learn that they should be skeptical and double check everything I tell them in class.

Thank you again for providing a great learning opportunity for our students!

What we are seeing is rather ominous. By refusing to take care of rampant fraud and over-inflated claims in top journals, US/UK scientific establishment is creating an environment of complete breakdown of trust and end of scientific progress. Textbook is the primary method of summarizing and passing down scientific knowledge derived by one generation to another, and therefore the textbook authors take conservative approach in trusting materials from recent discoveries. If the abstract of a highly-cited and unchallenged decade-old paper in a major journal cannot be trusted, then what can be? In this context, our readers may remember that the main challenge to ENCODE’s hype came from Dan Graur, the author of a well-used textbook in molecular evolutionary biology. Larry Moran, the author of a leading biochemistry textbook, has also been quite critical of claimed made by ENCODE-leader Ewan Birney.

We also see quite a bit of double standards in how the scientists from different countries are being treated by US ‘scientific leaders’ and their followers. When the Japanese STAP episode surfaced, US media and blogs maintained by well-known scientists howled for rapid action. All top newspapers, including NY Times, covered the episode as a major scandal leading to suicide of the talented embryologist Yoshiki Sasai.

Scientist Who Had Claimed Stem Cell Breakthrough Resigns From Japanese Research Institute

In a blow to the prestige of Japan’s scientific community, a government-backed research institute accepted the resignation of one of its highest-profile scientists on Friday after she failed to replicate research results that were once hailed as a breakthrough in stem cell research.

In contrast, Pachter’s previous exposure of supposed fraud by Manolis Kellis got promptly covered up by the journal. ENCODE-crook Ewan Birney got promoted in EMBL and also became Fellow of Royal Society despite being exposed, and we received little support or mention from anyone in the genetics community or the media for exposing Barbara Frederickson’s PNAS paper connecting gene expression and happiness. Yoav Gilad’s criticism of mouse ENCODE got minimal response from the authors and Stamatoupolous did not bother to even do that to the Chinese paper disputing his duon claim.

The situation is far worse. Lander and his protege Manolis Kellis, who are incapable of doing basic stats, continue to draw humongous amounts of money from NIH for their ‘epigenome roadmap’ project that has been criticized by real scientists like Eric Davidson, Olivert Hobart and Mark Ptashne (check ‘The Conspiracy of Epigenome?‘). They and their organization remain media darlings. Check Nature Promotes GWAS Madness to Study ‘Mental Health’ or the following NY Times article published three months back –

Project Sheds Light on What Drives Genes

“How does this conspiracy of genes work?” asked Eric Lander, director of the Broad Institute, a genetic research center affiliated with Harvard and M.I.T., who is not an author of the new papers. “This begins to connect the dots.”


To find the switches and figure out the circuits that control genes, researchers examined cells taken from healthy people and from patients suffering from a range of diseases including multiple sclerosis and diabetes. They also studied cells from different stages of life, including fetal cells and stem cells, which are present at the very earliest stage of development. Using those cells, investigators found millions of switches that control genes.

The results are published in 24 papers in Nature and other journals from Nature Publishing. The authors call it a road map to the human epigenome, a collection of chemical modifications of DNA that alter the way genetic information is used in different cells.

“We now have an unprecedented view of the living human genome,” said Manolis Kellis, a computer scientist at M.I.T. and a leader of the federally funded project.

Other media coverage of the ‘epigenome map’ junk science –

Forbes – In The Book Of Life, A Second Map Is Established

LA times – Roadmap Epigenomics Project reads between DNA’s genetic instructions

If you think the textbooks are being irreversibly damaged by inflated claims, imagine what happens to poor kids learning biochemistry directly from Lander’s own MOOC (online course). Larry Moran wrote in the sandwalk blog

There’s a course at MIT (Boston, USA) called “7.00x Introduction to Biology – The Secret of Life.” It’s a very popular MOOC (online course). Here’s the trailer for the course. In it, Eric Lander tells you that if you take his introductory biology course you will learn to think like a scientist and you will be able to understand the latest breakthroughs.

Here are the week one lectures that focus on biochemistry. I don’t have time to go through it all but check out the description of ATP beginning at 2:13. This is not how good teachers explain the importance of ATP in the 21st century but it is how it was taught 40 years ago.

Here’s the week two lectures (below). Check out the part at 2:08 where Eric Lander is talking about the “Energetics of Pathways.” See if you agree with his explanation and his references to entropy. Would you explain this without talking about forward and reverse rates?

The next section on “Tricks of Pathways” is very interesting. Keep in mind that he is discussing “tricks” that make glycolysis work but most of those reactions also work in reverse to make glucose (gluconeogenesis). This gets to be a problem with trick #2 at 2:25 when he says that an “unfavorable” reaction can be “pulled” in the right direction if the next reaction in the pathway is very favorable. This explanation was popular many decades ago but now we know that almost all the reactions are at equilibrium and ΔG = 0.

My point is that just because MIT is a prestigious university and Eric Lander is a famous scientist, does not mean that this is the best undergraduate course and the best way to teach biochemistry. Some of my colleagues at colleges you’ve never heard of could do a much better job. Eric Lander could do a better job if he would just read a modern textbook.


In other news, we plan to submit the following entry to Pachter’s p-value challenge.

Bioinformatics for Married Couples

Our recent research uncovered new principles at the interface of evolution and psychology that will lead to better understanding of marriages. Therefore, we are submitting it to the most visible journal, which happens to be Lior Pachter’s blog.


Many American marriages end up in divorces. A Japanese-American scientist noticed the rising trend in 1970 and explained it based on socio-evolutionary theory in his book. He suggested that man, being a solitary animal, does not like to remain paired and that is the reason for the rising trend.

His model was never tested experimentally, but due to availability of large-scale analysis tools, we were able to do that in our lab. We conducted an experiment, where we placed 457 married couples (457*2 persons) in a room and picked 80 persons using an elaborate pipeline developed by us (see attached script). Strikingly, in 95% of cases (72/76), we got only one member of the couple and not the other, supporting the theory of mentioned Japanese scientist.


# put 457 married couples in a room
# each member of a couple is assigned a number
for($i=1; $i<=457; $i++)
push @room,"HUSBAND$i";
push @room,"WIFE$i";

# pick 80 persons randomly
for($i=0; $i<80; $i++)
# randomly pick a person
$pos=rand; $pos=int($pos*$N);

# find the number on it
$selected_person=$room[$pos]; $selected_person=~s/HUSBAND//; $selected_person=~s/WIFE//;

# keep count

# how many of those 80 will be couples?
foreach $key (keys(%count))
$same++ if($count{$key}==2);

print "$same\n";

The above script (our elaborate pipeline) randomly picks 80 persons from a pool of 457*2. You can see the distribution of number of pairs in the following figure that is generated by running the script 10000 times.


On The Origin of Phyla – Interview with Valentine

Twitter Conversations (or Lack Thereof) between ENCODE/GTEx and Yoav Gilad in One Video

The members of the large and powerful ENCODE/GTex consortia, usually overactive in discussing their ‘high-quality science’ in social media, have strangely gone silent after the latest papers from Yoav Gilad.


An appropriate video to describe the situation –

YOAVGILAD Consortium on Results Published by ENCODE and GTEx

YOAVGILAD, a large Chicago-based consortium of human geneticists, is taking a critical look at some of the papers published by ENCODE and GTEx.

We are joking about the ‘consortium’ paper. Yoav Gilad is just one researcher at the University of Chicago, and his papers have at most two to four authors. How they have the audacity to criticize consortium of scientists, whose papers get covered by Washington Post (of all places), is something we cannot understand.

He and his colleagues were surprised to find that certain mouse tissues had more in common with each other than with their human analogies, for example.

“So a mouse liver is a lot more similar to a mouse kidney, in terms of gene expression, than a human liver, and that was a surprise,” Snyder said. “In hindsight, this makes a lot of sense.”

One thing for sure, a paper with ‘reanalysis’ in the title is expected to dig a lot of dirt (for example, here is our modest effort) and Gilad’s paper is no exception. It goes after the most newsworthy part of the ENCODE paper. His other paper on post-mortem tissues of GTEx is linked below, but please start with Dan Graur’s brief summary on that topic.

A reanalysis of mouse ENCODE comparative gene expression data

Recently, the Mouse ENCODE Consortium reported that comparative gene expression data from human and mouse tend to cluster more by species rather than by tissue. This observation was surprising, as it contradicted much of the comparative gene regulatory data collected previously, as well as the common notion that major developmental pathways are highly conserved across a wide range of species, in particular across mammals. Here we show that the Mouse ENCODE gene expression data were collected using a flawed study design, which confounded sequencing batch (namely, the assignment of samples to sequencing flowcells and lanes) with species. When we account for the batch effect, the corrected comparative gene expression data from human and mouse tend to cluster by tissue, not by species.

RNA-seq: impact of RNA degradation on transcript quantification


The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of degradation can be corrected via data normalization, or whether different transcripts are degraded at different rates, potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA samples in whole-genome expression profiling problematic. Yet, low quality samples (for example, samples collected in the course of fieldwork) are at times the sole means of addressing specific questions.


We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of RNA Integrity Number (RIN) values (a metric commonly used to assess RNA quality). We observed widespread effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library complexity in more degraded samples.


While standard normalizations failed to account for the effects of degradation, we found that by explicitly controlling for the effects of RIN using a linear model framework we can correct for the majority of these effects. We conclude that in instances in which RIN and the effect of interest are not associated, this approach can help recover biologically meaningful signals in data from degraded RNA samples.

Pachter’s Kallisto Comes with Unconventional License

We tested the latest and greatest RNAseq expression analysis program Kallisto discussed in this link. It comes with binary (for Mac and Ubuntu), and source code.

Running the binary

First we tested the binary code with the small data file already provided in the test folder, and it worked without trouble as expected. We are going to test on large files, but do not anticipate any problem either.

Remember this is an alignment-free method for RNAseq expression analysis. Therefore, if you diligently aligned all your reads on to the expression files, you may delete those large SAM/BAM files and proceed anew. The execution of this program will probably take less time than deleting the large SAM files from hard-drive – no kidding !

Compiling the source

Next, we tried to compile the code downloaded from github. That hit a glitch (“CMake 2.8.12 or higher is required. You are running version 2.8.9″). We tried to change the CMakeLists.txt to “cmake_minimum_required(VERSION 2.8.9)”, but that gave the error – ‘Unknown CMake command “add_compile_options”‘. It seems like we need to upgrade cmake to compile the source-code.

Algorithm and code

The algorithm of the method is explained in the arxiv paper by Bray et al.. It is an improvement over Sailfish, because instead of splitting the reads into k-mers, this program combines alignment and de Bruijn graph to do perform pseudoalignment of each read on indexed reference file.

The code (~3500 lines of C++ code, ~5,500 lines with header) is fairly simple to understand for those who are working on bioinformatics programs. It leverages Heng Li’s super-efficient kseq.h and SuperFastHash, and then builds the remaining blocks according to algorithm described above.


The license is very unusual and makes this program useless to build on. It is against sharing/rebuilding, because borrowing from their code with destroy your licensed code-base. Be very worried !

Let me explain. Most bioinformatics programs come with GPL or MIT/Berkeley-type license, but this one adds ‘for educational and research not-for-profit purposes’ making the license very restrictive and legally ambiguous. Let us say you add a file from it into your bioinformatics code released under GPL. Then you will be forced to add the same meaningless words into your GPL license, which will spoil the license of your GPL-licensed code. Then there is propagation effect. If someone else adds from your code, that person’s code will need to have the same extra words and so on. Similarly, you cannot use their code in a cloud application along with your data, because you cannot guarantee that the viewer of your data does not come from a company.

The entire license is reproduced below with emphasis added on the extra (and legally ambiguous) part.

Copyright ©2015. The Regents of the University of California (Regents). All Rights Reserved. Permission to use, copy, modify, and distribute this software and its documentation for educational and research not-for-profit purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, for commercial licensing opportunities.

Created by Nicolas Bray, Harold Pimentel, Pall Melsted and Lior Pachter, University of California, Berkeley




Following the footsteps of Sailfish, Kallisto is another excellent program that completely changes how RNAseq expression analysis gets done. They give us the ability to iterate over different transcriptome assemblies and annotation sets, which is very important for researchers working on non-model organisms.

The license is very restrictive and not helpful at all for those working on code. The code-base should be avoided at any cost.

Suffix Array in External Memory – Latest from Felipe Louza

Constructing suffix arrays in external memory is very useful, and we covered the topic in various blog posts (see here, here and here, here). Also, there is BEETL and RopeBWT for constructing BWT of very large read libraries in external memory.

Felipe Louza, a Brazilian researcher working in the field and a frequent commenter in previous blog posts, has a new update that our readers will find informative.

I have commited in GitHub a new version of a tool to construct generalized
enhanced suffix arrays.


The only related work (in mode 2) I know is,
but they are limit to indexing strings of the same length.

Enjoy !

Bioinformatics and Docker

Since it is our lightweight algorithm day, we thought it would be worthwhile to mention several relevant posts on Docker.

1. What is Docker?

Docker is a lightweight way to introduce portability in the code. It helps you containerize a code module tested on an operating system, and not run a separate VMware virtualization installation for every new module.

The following slides introduce you to Docker. Especially pay attention to the ‘matrix from hell’ slide.

[slideshare id=27669368&doc=introdockeroctober13-131028203253-phpapp02]

2. Docker in Bioinformatics

Docker is being used to containerize bioinformatics codes and there is already a biodocker project.

A few days back, redditers asked – Does Docker hit performance of bioinformatics algorithms?

The following blog post from Heng Li would be informative in that respect.

A few hours with docker

Preliminary thoughts

Docker is a bless to complex systems such as the old Apache+MySQL+PHP combo, but is a curse to simple command line tools. For simple tools, it adds multiple complications (security, kernel version, Dockerfile, large package, inter-process communication, etc) with little benefit.

Bioinformatics tools are not rocket science. They are supposed to be simple. If they are not simple, we should encourage better practices rather than live with the problems and resort to docker. I am particularly against dockerizing easy-to-compile tools such as velvet and bwa or well packaged tools such as spades. Another large fraction of tools in C/C++ can be compiled to statically linked binaries or shipped with necessary dynamic libraries (see salifish). While not ideal, these are still better solutions than docker. Docker will be needed for some tools with complex dependencies, but I predict most of such tools will be abandoned by users unless they are substantially better than other competitors, which rarely happens in practice.

PS: the only benefit of dockerizing simple tools is that we can acquire a tool with docker pull user/tool, but that is really the benefit of a centralized repository which we are lacking in our field.

Our view is in agreement with Heng Li’s last paragraph (“PS”), but Docker will likely have a different use in the bioinformatics world. More on that in a later blog post.



Adding a few other useful links –

1. Dockerized bioinformatics programs at biostar

2. Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker

We developed a freely available, easy to run implementation of bcbio-nextgen on Amazon Web Services (AWS) using Docker. bcbio is a community developed tool providing validated and scalable variant calling and RNA-seq analysis. The AWS implementation automates all of the steps of building a cluster, attaching high performance shared filesystems, and running an analysis. This makes bcbio readily available to the research community without the need to install and configure a local installation.

The entire installation bootstraps from standard Linux AMIs, enabling adjustment of the tools, genome data and installation without needing to prepare custom AMIs. The implementation uses Elasticluster to provision and configure the cluster. We automate the process with the boto Python interface to AWS and Ansible scripts. bcbio-vm isolates code and tools inside a Docker container allowing runs on any remote machine with a download of the Docker image and access to the shared filesystem. Analyses run directly from S3 buckets, with automatic streaming download of input data and upload of final processed data.

We provide timing benchmarks for running a full variant calling analysis using bcbio on AWS. The benchmark dataset was a cancer tumor/normal evaluation, from the ICGC-TCGA DREAM challenge, with 100x coverage in exome regions. We compared the results of running this dataset on 2 different networked filesystems: Lustre and NFS. We also show benchmarks for an RNA-seq dataset using inputs from the Sequencing Quality Control (SEQC) project.

3. Docker guide at basespace