General Description and First Puzzle Emailed to All Who Asked

Dear Readers,

A few days back, we asked “We Have Six Interesting Bioinformatics Puzzles, Any Takers?” and a number of you contacted us. We finally managed to write up the biological part and email you the first puzzle, along with a list of others. Please get in touch, if you have not received the mail yet.

What are we trying to achieve?

The purpose of these puzzles

In recent years, many smart computer scientists entered the field of bioinformatics, being attracted by NGS large-data problems. Due to lack of exposure to biology, they tend to focus on a set of mainstream questions, such as k-mer counting, genome assembly, transcriptome assembly, etc. On the other hand, biologists have many interesting questions, but addressing them needs a bit of introduction to biology. Moreover, many of those subproblems experience the same scalability issues as genome assembly or transcriptome assembly due to nextgen sequencing.

Lack of knowledge of biology is not an insurmountable difficulty and the puzzles are designed to help computer scientists go over that barrier. Each puzzle is designed to answer a biological question that is likely un-addressed in the era of NGS. We took time to explain the biological context and anticipate that once the reader makes some effort in understanding it, he will find the bioinformatics part relatively easy to solve.

We believe the outcome of this effort will be something publishable, but even if it is not, the exposure will help readers find closely related publishable questions in related areas. It is important to be exposed to as many fun problems like these as possible so that when the reader has a new computational tool, he can easily port it to answer a non-mainstream question.

Why are they called ‘puzzles’?

We believe each problem is fairly easy to solve after the biological part is understood. No guarantees however.

Publishing Model

With these puzzles, we are also planning to try a new publishing model. In this, we will quickly solve the puzzles, ask some scientists to openly review the work and then post the paper+review in arxiv or bioRxiv.

BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU

A new bioinformatics paper from Ruibang Luo​ and others in Tak-Wah Lam’s group.

BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 hours to process 50-fold whole genome sequencing (~750 million 100bp paired-end reads), or just 25 minutes for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses.

Measuring Reproducibility in Computer Systems Research

A research group in Arizona decided to reproduce as many published claims in computer science as possible. They built a website to track each program until they could fully build it and test it, and you can see the results at this link.


Reproducibility is a cornerstone of the scientific process: only if my colleagues can reproduce my work should they trust its veracity. Excepting special cases, in applied Computer Science reproducing published work should be as simple as going to the authors’ website, downloading their code and data, typing “make” and seeing if the results correspond to the published ones.

To investigate the extent to which Computer Science researchers are willing to share their code and data, and the extend to which this code will actually build with reasonable effort, we performed a study of 613 papers in eight ACM conferences (ASPLOS’12, CCS’12, OOPSLA’12, OSDI’12, PLDI’12, SIGMOD’12, SOSP’11, VLDB’12) and five journals (TACO’9, TISSEC’15, TOCS’30, TODS’37, TOPLAS’34).

The authors also wrote a comprehensive report on their findings. The entire report is worth reading, but we found ‘Anecdote 2′ at the end of the paper particularly interesting. It shows how far they are willing to go to recover materials related to the published claims. We are giving you steps 1 (first email request), 4 (Formal Request for Email Trail) and 5 (Request for the NSF grant application). The paper has lot more details on very interesting responses from the professors and student (tracked through LinkedIn !) .

Step 1.

A.2.1 First emails

On October 14, 2012 we had sent this email to the authors:

I’m Christian Collberg from the University of Arizona. I was
wondering if you have a copy of {SYSTEM-R} that we could
try out? We have a new {SYSTEM} and I’d like to see how
well your system handles our {TECHNIQUES}.
Thank you!

Three of the four email addresses extracted from the author list in the paper failed, the fourth, to {PROFESSOR}, appeared to work. We received no response.
In February 19, 2013, we tried again:

Hi again!
I’m Christian Collberg from the University of Arizona. I wrote
to you earlier to see if I could have a copy of hSYSTEM-Ri so
that we could try it out, but never received a reply. We have a
new {SYSTEM} and I’d like to see how well your system handles

Step 4.

A.2.4 Formal Request for Email Trail

We next sent a request for the emails between the authors, in order to trace the whereabouts of the source code of {SYSTEM-R}:

Pursuant to {OPEN RECORDS STATUTE}, the Open Records Act,
I request copies of all electronic mail between
regarding the article
published in
and the development of the
system described therein.
In accordance with {OPEN RECORDS STATUTE} I request that the
electronic databases to which the following email addresses
are attached be searched:
I also request that any private (non-university) email accounts containing work-related emails be searched.

I am willing to pay applicable fees. These records are sought in furtherance of scholarly research, and I am employed by an Educational,
Non-commercial Scientific Institution. Therefore, I ask that fees,
other than duplication fees, be waived.

and finally stage 5 !!

A.2.5 Request for the NSF grant application

We made a formal request to the NSF for the two applications of the grants that supported the research. In one, the Principal Investigator ({PROFESSOR} in the discussion above) writes:

We wil also make our data and software available to the research community when appropriate.

Needless to say, we never received any communication from {PROFESSOR}, in particular no explanation as to why releasing the code might not have been \appropriate.”

Github Book on “Introduction To Applied Bioinformatics”

This seems like a fascinating resource. To browse the book, check here, or grab it from github and enjoy !! (h/t: Nick Loman)

Bioinformatics, as I see it, is the application of the tools of computer science (things like programming languages, algorithms, and databases) to address biological problems (for example, inferring the evolutionary relationship between a group of organisms based on fragments of their genomes, or understanding if or how the community of microorganisms that live in my gut changes if I modify my diet). Bioinformatics is a rapidly growing field, largely in response to the vast increase in the quantity of data that biologists now grapple with. Students from varied disciplines (e.g., biology, computer science, statistics, and biochemistry) and stages of their educational careers (undergraduate, graduate, or postdoctoral) are becoming interested in bioinformatics.

I teach bioinformatics at the undergraduate and graduate levels at Northern Arizona University. This repository contains some of the materials that I’ve developed in these courses, and represents an initial attempt to organize these materials in a standalone way. In some cases, I’m just linking out to other materials for now.

ENCODE Backers Want US to Retract Commentary, Should We?

Dear Readers,

A number of ENCODE-backers are aghast and requesting us to take down a previous commentary (see tweets at the bottom). What do you suggest? We will explain our rationale for reporting and would like to hear what you think.

We believe we are following the standard journalistic procedures. Here is “Chapter 62: Privacy and public interest” discussing journalistic standards about reporting of ‘private life’ by news sources.

In our opinion,

i) ENCODE is a gigantic publicly funded project and therefore its leaders are subject to public scrutiny.

ii) Moreover, ENCODE jumped across the ‘scientific argument’ barrier by manipulating media in its press releases. Not only that, their press releases were deliberate lies (defined based on journalistic rules), if the latest PNAS paper is any guide. Therefore, they should be treated based on journalistic standards and not based on ‘let us keep the discussion to science’ standards. There should be a cost for using misleading press releases in order to gain public funds, which is what ENCODE did.

Back to the chapter on journalism -

In this chapter, we look at the relationship between a person’s right to privacy and the public’s right to know about that person’s life. We discuss what it means to be a public figure and what rights journalist have to examine their lives and the lives of their families. We conclude by examining the rights of people to grieve in private.

The readers are encouraged to go through the detailed discussions in the chapter about what a journalist should and should not report about private lives of public figures. Let us mention the summary and a few relevant paragraphs.


You have a right to report on the public life of public figures

You can report on the private life of public figures if

it tells something about their character which might affect their public duty
they are responsible for public assets

their private misdeeds could affect the public good

You have no right to intrude on a person’s private life where there is no public benefit

From another section -

Public figures

How far can you probe into a person’s private life to get news? This is most easily answered where the individuals are public figures, especially where they are people who have put themselves forward for public positions of trust. We are talking here particularly about people like politicians, group leaders, clergymen and all those people whose personalities and private morality are essential parts of their work.

You must make a distinction between those people who have voluntarily entered the public arena and those who are forced into it by circumstances they could not reasonably have expected. For example, a businessman who holds a press conference to announce some new money-making project is seeking public attention; the airline hostess who suddenly discovers she has contracted a rare tropical disease has simply been thrust into the news against her will.

You could justify probing into both the public and private finances of the businessman. You cannot justify digging up scandalous details of the flight attendant’s private life where it does not have any relevance to the story of the disease.

There is also the question of who is a public figure. Most journalists would accept that it is their duty to examine the whole life of someone like the President of the United States in detail because he put himself forward to be President. His press secretary acts as the President’s mouthpiece on many public issues and is expected to reflect the President’s thinking. Is the press secretary a public figure? Would journalists be justified in publishing stories about his affair with an office cleaner?

The answer to the first question is that maybe he is a public figure. The answer to the second question is probably “No”, we should not write about his affair with the office cleaner – unless he was giving the cleaner government secrets in bed, and she was passing them on to an enemy. Or if there was a chance that he could be blackmailed into betraying his public trust because of the affair.

From elsewhere in the same document -

Private morality can tell us something about the person’s character, and how it could affect their professional performance. If, in his private life, a public figure is found to have lied in a serious way, the public should be made aware that he could be lying in his work, too. Where public figures are responsible for setting a moral tone in society, any private immorality should be exposed as hypocrisy. For example, society should be aware that a leading campaigner against child abuse regularly beats his own children.

and the most important part -

The media should constantly examine the lives of public figures with responsibility for public funds and other assets. Politicians who have the power to influence the awarding of contracts should accept that their private friendships with business people should be open to public view. After all, it is taxpayers’ money they could be giving away illegally. Politicians can promise voters that their friendships will never influence them in public office. As a journalist, you should monitor whether they keep that promise.

Does the public have a right to know about private lives of a scientist, who is on several NHGRI advisory panels, who is being accused of fraud by a reputed Berkeley professor AND who used media to send out misleading information about his research?

Should the scientists expect the journalists to apply only those rules, which work in their favor irrespective of the actions of scientists?

If these scientists are so morally upright, why are they not applying their scientific standards to ask ENCODE to retract its Nature paper or to ask Kellis to retract his Nature Biotech paper being discussed on by Bray/Pachter?

Is It Time to Get the ENCODE Paper Retracted?


Based on everyone’s request, we removed the commentary from our blog. It is distracting us from other scientific topics we like to cover. Everything we posted were from public domain and we do not have to reproduce them here to make a point. We hope these people go against the 80% functionality claim in the media with the same zeal.







A Very Good Discussion on ‘Missing Heritability Problem’

Evan Charney from Duke Institute for Brain Science, Duke University wrote an informative commentary titled -

Still Chasing Ghosts: A New Genetic Methodology Will Not Find the “Missing Heritability”

It touches on many issues we discuss here regarding using massive volume of data to find genetic clues for complex diseases, and talks about yet another ENCODEsque junk science project called GCTA, or Genome-wide complex trait analysis.

One of the hopes and promises of the Human Genome Sequencing Project was that it would revolutionize the understanding, diagnosis, and treatment of most human disorders.


And to date, not a single polymorphism has been reliably associated with any psychiatric disorders nor any aspect of human behavior within the “normal” range (e.g., differences in “intelligence”).

To some researchers this state of affairs has given rise to a conundrum known as the “problem of missing heritability.” If traits such as intelligence are reported to be 50% heritable, goes the theory, why have no genes associated with intelligence been identified?

Why does it not work? Well, we have the same culprits – (i) false positives, (ii) false model.

On false positives, here is the short summary of a paragraph with detailed description -

GCTA studies are highly vulnerable to confounding by population stratification

Genetic studies (by whatever method) that have so far purported to identify SNPs associated with one or another trait have more often than not been false positives [18-20]. A prime cause of this has been the failure of researchers to take adequately into account population stratification.

The problem of ‘false model’ is even more critical.

Further problems of GCTA

While I have focused on population stratification, there are at least two other things to note about GCTA studies. First, GCTA assumes “additive genetic variance,” i.e., that each polymorphism contributes a tiny amount to heritability and that the “effects” of all the polymorphisms can simply be added together. This ignores widespread evidence that genes influence the effects of other genes in highly complex, non-additive ways (“G x G” interactions), and that the environment influences the manner in which genes are transcribed in equally complex ways (“G x E” interactions). Second, all GCTA estimates are derived from looking only at SNPs, but SNPs are only one form of genetic polymorphism. There are numerous other kinds of prevalent genetic variations, including copy number variations, multiple copies of segments of genes, whole genes, and even whole chromosomes. There is no rational scientific reason to assume that SNPs are the only relevant, or even the “most important” form of genetic variation (other than the fact that SNP data is easiest to obtain).

The article received many good comments from others appalled by these wasteful stamp-collection projects. Ken Weiss, Penn State professor and long-term critic of junk sciences like ENCODE, GWAS, GCAT, etc., wrote -

I agree with what is said here generally, but I don’t think it will make the missing heritability (Mh) ‘problem’ go away. There are too many people who for many different reasons believe that more sophisticated or greater sampling, or more extensive sequencing and analysis, and larger studies, paired with animal models, will eventually account for Mh. Whether this is a correct belief or as much a rationale for funding continual increases in study scale, is debatable.

Rare variants reflect one ‘out’ that is often invoked, and they certainly require large studies of one sort or another. The question here is whether enumerating rare variants and demonstrating their causal role (if it can actually be done) will do much, especially since most rare variants will be like their more common known ones, and have very small individual effects.

Another strategy is to blame the mH on interactions. Huge studies or very clever designs may identify such interactions and evaluate their import, perhaps at least generically if not by enumeration.

So the problem will, I predict, persist. That doesn’t mean the claims about how to find mH are justified.

However, the comment by M. C. Jones summarizes the state of affairs in two sentences.

Failed paradigms have a way of slouching on from beyond the grave after they’ve been declared dead, especially if they have become lucrative, prestigious, and elaborate industries supporting many livelihoods and paying the mortgages on many yachts.

We recommend the readers to go through both the article and the comments. You will find a lot to think about in the back and forth discussions.

Debunking ENCODE’s Latest Propaganda Piece Para by Para

ENCODE leaders published their latest propaganda piece in PNAS -

“Defining functional DNA elements in the human genome”

and Dan Graur has done fantastic work of tearing it apart.

@ENCODE_NIH in PNAS 2014: In 2012, the Dog Ate Our Lab Notebook and We Had No Laxative to Retrieve It

Instead of rewriting his blog post, let us comment on random samples from here and there in the article.

Marketing Your Science, LLC

We never had honor of writing papers with co-authors from ‘scientific’ organizations like above. WTF – ‘science’ (i.e. an effort to find truth) needs marketing these days to get funded by government? Are those marketing efforts also paid by ENCODE/NHGRI?

That leads to the questions of how much of the PNAS paper is science and how much is ‘advertising’ (i.e. half truths and systematic efforts to hide the negatives)?

Proposed Future plan

In the last paragraph, ENCODE tells us what their next ‘big science’ scheme is going to be.

The data identify very large numbers of sequence elements of differing sizes and signal strengths. Emerging genome-editing methods (113, 114) should considerably increase the throughput and resolution with which these candidate elements can be evaluated by genetic criteria.

That is downright scary !! The ENCODE buffoons ‘proved’ 80% of human genome to be biochemically functional (whatever that means), which also implies that they will be allowed to waste the largest amount of money to do CRISPR-cas9 editing of nearly the entire human genome.

Speaking of CRISPR, we should also keep in mind that the discovery came from a French food company and not through NIH funds. If you read about the discovery in above link, you will realize that such innovative research projects are in fact at great disadvantage at NIH due to ENCODE-like wasteful and human-centric big science projects. Who cares about the fight between bacteria and phage, which has little ‘human connection’?


On Functional Elements

ENCODE clowns wrote -

Despite the pressing need to identify and characterize all functional elements in the human genome, it is important to recognize that there is no universal definition of what constitutes function, nor is there agreement on what sets the boundaries of an element.

Readers should note that the ‘pressing need’ claim came from ENCODE ‘leaders’ and nobody else. They were the ones, who invented this fictitious make-work project. Let us go through a bit of history. In 2003, our group at NASA (PI – Victor Stolc) and a Yale group involving two would-be ENCODE leaders published a tiling array scan of the entire human genome in Science.

Global identification of human transcribed sequences with genome tiling arrays

It was one of our most expensive projects at that time (~$100K of arrays). In fact, for much much less money, we made a number of other innovative studies in a range of model organisms. Cost of our very cool chlamy paper was only $4K of arrays and identified a set of potential regulators in human eye. Our yeast paper had a cost of around the same ($6K), sea urchin a bit more ($29K) and so on, and each experiment was a breakthrough compared to the status quo in those organisms. We often wondered whether the human effort was worth the money in comparison, given how much more we learned from experiments in other organisms. Little did we realize that Snyder/Gerstein would get to flip it to the government to waste $300 million dollars for almost no science !

ENCODE’s own website presents the following purpose for their project -

The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence.

That is from the first paragraph of overview in the website, which is from the first section of the entire website. After telling everyone for a decade that ENCODE’s goal was to find ‘functional elements’, now the buffoons inform us that it is impossible to define either ‘functional’ or ‘element’. That IS simply priceless !!!

Some of these differences stem from the fact that function in biochemical and genetic contexts is highly particular to cell type and condition.

Oh geez, did you just figure that out? And your solution was to take as many cell-types as possible and add up all effects to claim as much human genome to be as functional to write eulogy for junk DNA, right?

Moreover, each approach remains incomplete, requiring continued method development (both experimental and analytical) and increasingly large datasets (additional species, assays, cell types, variants, and phenotypes).

Well, one big thing that is complete is the number 100. Once you reach 100% of the genome, we hope it will not be necessary to look further for functional elements.


Low, Medium and High Evidence

A 2012 press release from Zhiping Weng, one of the ENCODE leaders, said -

Using data generated from 1,649 experiments—with prominent contributions from the labs of Job Dekker, PhD, professor of biochemistry & molecular pharmacology and molecular medicine, and Zhiping Weng, PhD, professor of biochemistry & molecular pharmacology—the group has identified biochemical functions for an astounding 80 percent of the human genome. These findings promise to fundamentally change our understanding of how the tens of thousands of genes and hundreds of thousands of gene regulatory elements, or switches, contained in the human genome interact in an overlapping regulatory network to determine human biology and disease.

Note the words – ‘identified biochemical functions for an astounding 80 percent of the human genome’, ‘fundamentally change our understanding’.

Here is the main figure of the new PNAS paper -


It suggests that the previously marketed astounding discovery was backed by low evidence data. Not only that, the figure caption of the new figure has the audacity to say -

This summary of our understanding in early 2014 will likely evolve substantially with more data….

So, we now learn that $300 million later, ENCODE did not manage to generate enough ‘data’ to back a point, which they previously claimed to be true in numerous press releases. It is as if NASA goes around telling everyone that that they heard voices from humans living in Urenus, and reveal on further prodding that the claim was all based on low-quality information.

Such a thing can only happen in the ENCODE world !


GATA1 – Small Science versus Big Science

The paper discusses one gene – GATA1 in quite a bit of detail. Why GATA1? A bit of googling tells us that it is the gene Ross Hardison, the corresponding author, spent years working on and understands well. That is a product of ‘small science’, which is supposedly discredited by ENCODE-type large-scale efforts.

In fact, resorting to a familiar gene disproves everything ENCODE stands for. If big science worked, ENCODE team could have easily pulled an arbitrary segment of genome they knew nothing about and given example of how easy it was to work out the biology of the region as much as GATA1.

The reason we bring this up is because we stared at our own tiling array data long enough to know how beautiful the known regions looked like. Troubles started, only when we tried to use that familiarity to make large-scale inferences about the entire genome.


On Evolution

ENCODE leaders have a very narrow view of evolution, which relies only on sequence conservation of a restrictive form. In their paper, they claim that evolutionary method (i.e. sequence conservation) is not enough to identify all ‘functional elements’ and therefore ENCODE type waste is necessary. However, that is complete nonsense. Evolutionary research works at many scales and here is a good example again from CRISPR. After the French group identified function of CRISPR repeats in one bacteria, other researchers observed the presence of similar sequence patterns in the genomes of a large number of bacteria and archaea. However, until the first discovery was made, most researchers analyzing those genomes ignored the same CRISPR blocks as useless.

The key point is that functionality can be inferred through evolutionary argument only by doing real biological experiments in many contexts/organisms and not by running as many large-scale experiments in human cells as possible. Now think about how many such real biological experiments could not be done due to ENCODE. Even if ENCODE’s $300 million were split into 300 $1M grants on random organisms and half of the groups did not find anything, we would have learned a lot more about functionality of human genes through evolutionary inference than what we learned from ENCODE.


Case for Abundant Junk DNA

The large section titled ‘case for abundant junk dna’ makes no evolutionary sense. Over the last three years, we have been working on a fish that is reasonably close to Danio (zebrafish) in evolutionary time-frame. The genome size of this fish is only half of zebrafish, even though it looks the same, grows the same and swims in the same way.
Yet this is what we see in our genomic data, when we make gene-by-gene comparison (there is considerable synteny). The exons are of nearly the same size in both fish species, whereas the introns are about half the size of Danio introns for a large number of genes and large number of intronic regions. We have made this comparison genome-wide and we have manually compared a number of genes for various reasons. We always found the introns to shrink by about 50%.

How can that happen, if 80% of Danio genome were functional? If at all, the fish we are working with should have more functional areas, because it has three big organs Danio does not have. With that understanding, check the following sentence from their paper -

ENCODE maps of polyadenylated and total RNA cover in total more than 75% of the genome. These already large fractions may be underestimates, as only a subset of cell states have been assayed.

Geez, apparently if you give these clowns more money, they can easily find evidence for 150% of human genome to be active. So, possibly the only solution is to give them 1/10th of money and get that 75% down to a healthy 7.5% :)

Joke apart, there is solid mathematics behind the above statement. Check -

Big Data – Where Increasing Sample Size Adds More Errors

Conventional wisdom says larger sample size will make experiments more accurate, but that does not work in many situations. More samples can result in more errors, as explained by the following example.

Ask 1 million monkeys (~2^20) to predict the direction of the stock market for the year. At the end of the year, about 500K monkeys will be right and about 500K monkeys will be wrong. Remove the second group and redo the experiment for the second year. At the end of the year, you will be left with 250K monkeys, who correctly called the market for two years in a row. If you keep doing the same experiment for 20 years, you will be left with one monkey, who predicted the stock market correctly for 20 years in a row. Wow !

Or according to Taleb -

The winner-take-all effects in information space corresponds to more noise, less signal. In other words, the spurious dominates.

Information is convex to noise. The paradox is that increase in sample size magnifies the role of noise (or luck); it makes tail values even more extreme. There are some problems associated with big data and the increase of variable available for epidemiological and other “empirical” research.

Back to the ENCODE paper -

However, for multiple reasons discussed below, it remains unclear what proportion of these biochemically annotated regions serve specific functions.

Those ‘multiple reasons’ argue about anything from 5 percent to 5 gazillion percents, whereas throwing a dart is possibly a more exciting way to arrive at the answer. We expected the pilot ENCODE and ENCODE to answer those questions even if for a part of the genome, not sell us an expensive dartboard.


Lack of Null Hypothesis in Defining ‘Functional Element’

A few months back we covered an experiment of Mike White, which showed that “Random DNA Sequence Mimics #ENCODE !!“.

Last September, there was a wee bit of a media frenzy over the Phase 2 ENCODE publications. The big story was supposed to be that ‘junk DNA is debunked’ – ENCODE had allegedly shown that instead of being filled with genetic garbage, our genomes are stuffed to the rafters with functional DNA. In the backlash against this storyline, many of us pointed out that the problem with this claim is that it conflates biochemical and organismal definitions of function: ENCODE measured biochemical activities across the human genome, but those biochemical activities are not by themselves strong proof that any particular piece of DNA actually does something useful for us.

The claim that ENCODE results disprove junk DNA is wrong because, as I argued back in the fall, something crucial is missing: a null hypothesis. Without a null hypothesis, how do you know whether to be surprised that ENCODE found biochemical activities over most of the genome? What do you really expect non-functional DNA to look like?

In our paper in this week’s PNAS, we take a stab at answering this question with one of the largest sets of randomly generated DNA sequences ever included in an experimental test of function. We tested 1,300 randomly generated DNAs (more than 100 kb total) for regulatory activity. It turns out that most of those random DNA sequences are active. Conclusion: distinguishing function from non-function is very difficult.

Strangely, the new ENCODE paper completely overlooked that that important point. While ENCODE propaganda piece says on ‘What Fraction of the Human Genome Is Functional?’

Limitations of the genetic, evolutionary, and biochemical approaches conspire to make this seemingly simple question difficult to answer.

their lack of a null hypothesis should be treated as an even bigger ‘conspiracy’ !


Using Discredited GWAS Studies as Support for More Functional Blocks

ENCODE paper says -

Results of genome-wide association studies might also be interpreted as support for more pervasive genome function. At present, significantly associated loci explain only a small fraction of the estimated trait heritability, suggesting that a vast number of additional loci with smaller effects remain to be discovered. Furthermore, quantitative trait locus (QTL) studies have revealed thousands of genetic variants that influence gene expression and regulatory activity (94–98). These observations raise the possibility that functional sequences encompass a larger proportion of the human genome than previously thought.

Give us a break, dudes !! ‘Thousands’ within a genome consisting of 3 billion nucleotides means nothing at all.

Moreover many of those same GWAS-type ‘discoveries’ are getting ENCODE-like treatment from respected scientists. For example, check what Bernard Strauss said about abundant false positives in newly discovered cancer mutations.

Mutation and Cancer – A View from Retirement

What does seem clear is that the cancer genome project and the cancer atlas are examples of the inefficiency that is the consequence of funding large projects without accompanying large ideas. To be fair, given the impetus of the new technology it was probably impossible not to set these machines on to the available tumors in the expectation of finding druggable targets. However, the suggestion of the Lawrence paper that “the ultimate solution will probably involve . . . massive amounts of whole genome sequencing” amounts to a dogged adherence to a failed strategy – similar to the massive attacks on the trenches by the Generals of World War I.
The conclusion of the paper is that most of the mutations recognized as “drivers” are really false positives, the result of not having properly calculated control mutation rates.

Valuable Public Resource Excuse

The major contribution of ENCODE to date has been high-resolution, highly-reproducible maps of DNA segments with biochemical signatures associated with diverse molecular functions. We believe that this public resource is far more important than any interim estimate of the fraction of the human genome that is functional.

Q1. Who decided that those ‘public resources’ were needed?
Q2. Given that they were merely ‘resources’, did we get them at the best price achieved through competitive bidding?


Asymmetry of Press Releases

From another part of the paper -

Although ENCODE has expended considerable effort to ensure the reproducibility of detecting biochemical activity (99), it is not at all simple to establish what fraction of the biochemically annotated genome should be regarded as functional.

The entire paragraph and the subsequent one explains in great detail how difficult it is to call a transcript. In 2014, ENCODE leaders are telling us what those of us working on tiling arrays knew for long time, and we took great care in our 2003 paper not to over-inflate the results. Even in those days, we noticed the analysis by Affymetrix groups (led by Gingeras) to be on the side of making inflated claims, despite having lower quality of probes. What took them so long to learn?

Apart from that, we are puzzled that this PNAS paper making so many important and valuable points did not come with press releases and youtube videos, such as one from 2012.


‘Conflict of Interest’ Statement

The paper boldly states “The authors declare no conflict of interest”. We wish the authors simply omitted that statement, because it gives the impression that the authors are scientists giving unbiased and objective opinions. How about this as an example of ‘conflict of interest’ (from an ENCODE leader’s 2012 ENCODE press release)?

For the next phase of the ENCODE project, Weng received a four-year, $8 million grant from the NIH to lead the Data Analysis Center of the project.

It is clear to anyone who can think, that ENCODE leaders publish press releases with claims of making astounding discoveries and then get rewarded by Eric Green (an author of the paper) with more of other people’s money to waste.
Maybe it is time for scientists with interest in getting related large grants to mention any work on the project as ‘conflict of interest’ !



Why this complete mockery of science called ENCODE does not get immediately shut down, when millions of real scientists are closing their labs, is far beyond our ability to fathom.

How to Enhance Career by Publishing in Open-access Journal

We previously asked about what arguments to use to convince co-authors to publish their preprints at arxiv/biorxiv.

Arxiv and BioRxiv Fans – Please Help

A new blog post by impactstory covers the next step of the process, which is to convince everyone to publish in “Megajournals”, or open-access online journals like PLOS One, BMJ open, etc.

The Perils of Megajournals–and How to Avoid Them

In an ideal world, where a paper is published should not be an issue, because you are judged by your intellectual contributions, right? Sadly, many superficial academics still decide whether a paper is good or bad based on journal, citation count, # of retweets, university of researchers and so on. If you are in that kind of business, the linked article addresses the following points to help you.

1. My co-authors won’t want to publish in megajournals

2. No one in my field will find out about it

3. My CV will look like I couldn’t publish in “good” journals

Enjoy !!

(and always remember, a highly cited flawed paper is still a flawed paper).


Tragedy of the Day: PNAS Got Duped by Positivity Lady !!

Nick Brown on Disrupting Science of Happiness Lady

Mapping to a Graph-style Reference Genome – Arxiv Paper

David Haussler and colleagues had been working on how to align a new read library against thousands of human genomes. The easy solution of performing the same alignment thousand times does not scale well. Benedict Paten, Adam Novak and David Haussler posted a new paper in arxiv to address this question. This appears like an interesting paper, but we have not read it carefully to understand the innovation beyond what Juni Siren (corrected) and others proposed before.

To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for mapping each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for mapping to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.

Readers may also find the following work from Haussler group relevant -

HAL: a Hierarchical Format for Storing and Analyzing Multiple Genome Alignments


Speaking of Juni Siren’s algorithm, following twitter discussions are relevant -


Dangerous Central Planners Coming to ‘Rescue’ US Biomedical Research

Four fools enamored with central planning (Bruce Alberts, Marc W. Kirschner, Shirley Tilghman, and Harold Varmus) wrote an article in PNAS titled -

Rescuing US biomedical research from its systemic flaws
Continue reading Dangerous Central Planners Coming to ‘Rescue’ US Biomedical Research