Archives

Categories

On The Origin of Phyla – Interview with Valentine

Twitter Conversations (or Lack Thereof) between ENCODE/GTEx and Yoav Gilad in One Video

The members of the large and powerful ENCODE/GTex consortia, usually overactive in discussing their ‘high-quality science’ in social media, have strangely gone silent after the latest papers from Yoav Gilad.

Capture

An appropriate video to describe the situation –

YOAVGILAD Consortium on Results Published by ENCODE and GTEx

YOAVGILAD, a large Chicago-based consortium of human geneticists, is taking a critical look at some of the papers published by ENCODE and GTEx.

We are joking about the ‘consortium’ paper. Yoav Gilad is just one researcher at the University of Chicago, and his papers have at most two to four authors. How they have the audacity to criticize consortium of scientists, whose papers get covered by Washington Post (of all places), is something we cannot understand.

He and his colleagues were surprised to find that certain mouse tissues had more in common with each other than with their human analogies, for example.

“So a mouse liver is a lot more similar to a mouse kidney, in terms of gene expression, than a human liver, and that was a surprise,” Snyder said. “In hindsight, this makes a lot of sense.”

One thing for sure, a paper with ‘reanalysis’ in the title is expected to dig a lot of dirt (for example, here is our modest effort) and Gilad’s paper is no exception. It goes after the most newsworthy part of the ENCODE paper. His other paper on post-mortem tissues of GTEx is linked below, but please start with Dan Graur’s brief summary on that topic.

A reanalysis of mouse ENCODE comparative gene expression data

Recently, the Mouse ENCODE Consortium reported that comparative gene expression data from human and mouse tend to cluster more by species rather than by tissue. This observation was surprising, as it contradicted much of the comparative gene regulatory data collected previously, as well as the common notion that major developmental pathways are highly conserved across a wide range of species, in particular across mammals. Here we show that the Mouse ENCODE gene expression data were collected using a flawed study design, which confounded sequencing batch (namely, the assignment of samples to sequencing flowcells and lanes) with species. When we account for the batch effect, the corrected comparative gene expression data from human and mouse tend to cluster by tissue, not by species.

RNA-seq: impact of RNA degradation on transcript quantification

Background

The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of degradation can be corrected via data normalization, or whether different transcripts are degraded at different rates, potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA samples in whole-genome expression profiling problematic. Yet, low quality samples (for example, samples collected in the course of fieldwork) are at times the sole means of addressing specific questions.

Results

We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of RNA Integrity Number (RIN) values (a metric commonly used to assess RNA quality). We observed widespread effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library complexity in more degraded samples.

Conclusions

While standard normalizations failed to account for the effects of degradation, we found that by explicitly controlling for the effects of RIN using a linear model framework we can correct for the majority of these effects. We conclude that in instances in which RIN and the effect of interest are not associated, this approach can help recover biologically meaningful signals in data from degraded RNA samples.

Pachter’s Kallisto Comes with Unconventional License

We tested the latest and greatest RNAseq expression analysis program Kallisto discussed in this link. It comes with binary (for Mac and Ubuntu), and source code.

Running the binary

First we tested the binary code with the small data file already provided in the test folder, and it worked without trouble as expected. We are going to test on large files, but do not anticipate any problem either.

Remember this is an alignment-free method for RNAseq expression analysis. Therefore, if you diligently aligned all your reads on to the expression files, you may delete those large SAM/BAM files and proceed anew. The execution of this program will probably take less time than deleting the large SAM files from hard-drive – no kidding !

Compiling the source

Next, we tried to compile the code downloaded from github. That hit a glitch (“CMake 2.8.12 or higher is required. You are running version 2.8.9″). We tried to change the CMakeLists.txt to “cmake_minimum_required(VERSION 2.8.9)”, but that gave the error – ‘Unknown CMake command “add_compile_options”‘. It seems like we need to upgrade cmake to compile the source-code.

Algorithm and code

The algorithm of the method is explained in the arxiv paper by Bray et al.. It is an improvement over Sailfish, because instead of splitting the reads into k-mers, this program combines alignment and de Bruijn graph to do perform pseudoalignment of each read on indexed reference file.

The code (~3500 lines of C++ code, ~5,500 lines with header) is fairly simple to understand for those who are working on bioinformatics programs. It leverages Heng Li’s super-efficient kseq.h and SuperFastHash, and then builds the remaining blocks according to algorithm described above.

License

The license is very unusual and makes this program useless to build on. It is against sharing/rebuilding, because borrowing from their code with destroy your licensed code-base. Be very worried !

Let me explain. Most bioinformatics programs come with GPL or MIT/Berkeley-type license, but this one adds ‘for educational and research not-for-profit purposes’ making the license very restrictive and legally ambiguous. Let us say you add a file from it into your bioinformatics code released under GPL. Then you will be forced to add the same meaningless words into your GPL license, which will spoil the license of your GPL-licensed code. Then there is propagation effect. If someone else adds from your code, that person’s code will need to have the same extra words and so on. Similarly, you cannot use their code in a cloud application along with your data, because you cannot guarantee that the viewer of your data does not come from a company.

The entire license is reproduced below with emphasis added on the extra (and legally ambiguous) part.

Copyright ©2015. The Regents of the University of California (Regents). All Rights Reserved. Permission to use, copy, modify, and distribute this software and its documentation for educational and research not-for-profit purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, for commercial licensing opportunities.

Created by Nicolas Bray, Harold Pimentel, Pall Melsted and Lior Pachter, University of California, Berkeley

IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED “AS IS”. REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

Summary

Following the footsteps of Sailfish, Kallisto is another excellent program that completely changes how RNAseq expression analysis gets done. They give us the ability to iterate over different transcriptome assemblies and annotation sets, which is very important for researchers working on non-model organisms.

The license is very restrictive and not helpful at all for those working on code. The code-base should be avoided at any cost.

Suffix Array in External Memory – Latest from Felipe Louza

Constructing suffix arrays in external memory is very useful, and we covered the topic in various blog posts (see here, here and here, here). Also, there is BEETL and RopeBWT for constructing BWT of very large read libraries in external memory.

Felipe Louza, a Brazilian researcher working in the field and a frequent commenter in previous blog posts, has a new update that our readers will find informative.

I have commited in GitHub a new version of a tool to construct generalized
enhanced suffix arrays.

link: https://github.com/felipelouza/egsa

The only related work (in mode 2) I know is https://github.com/BEETL/BEETL,
but they are limit to indexing strings of the same length.

Enjoy !

Bioinformatics and Docker

Since it is our lightweight algorithm day, we thought it would be worthwhile to mention several relevant posts on Docker.

1. What is Docker?

Docker is a lightweight way to introduce portability in the code. It helps you containerize a code module tested on an operating system, and not run a separate VMware virtualization installation for every new module.

The following slides introduce you to Docker. Especially pay attention to the ‘matrix from hell’ slide.

[slideshare id=27669368&doc=introdockeroctober13-131028203253-phpapp02]

2. Docker in Bioinformatics

Docker is being used to containerize bioinformatics codes and there is already a biodocker project.

A few days back, redditers asked – Does Docker hit performance of bioinformatics algorithms?

The following blog post from Heng Li would be informative in that respect.

A few hours with docker

Preliminary thoughts

Docker is a bless to complex systems such as the old Apache+MySQL+PHP combo, but is a curse to simple command line tools. For simple tools, it adds multiple complications (security, kernel version, Dockerfile, large package, inter-process communication, etc) with little benefit.

Bioinformatics tools are not rocket science. They are supposed to be simple. If they are not simple, we should encourage better practices rather than live with the problems and resort to docker. I am particularly against dockerizing easy-to-compile tools such as velvet and bwa or well packaged tools such as spades. Another large fraction of tools in C/C++ can be compiled to statically linked binaries or shipped with necessary dynamic libraries (see salifish). While not ideal, these are still better solutions than docker. Docker will be needed for some tools with complex dependencies, but I predict most of such tools will be abandoned by users unless they are substantially better than other competitors, which rarely happens in practice.

PS: the only benefit of dockerizing simple tools is that we can acquire a tool with docker pull user/tool, but that is really the benefit of a centralized repository which we are lacking in our field.

Our view is in agreement with Heng Li’s last paragraph (“PS”), but Docker will likely have a different use in the bioinformatics world. More on that in a later blog post.

—————————-

Edit.

Adding a few other useful links –

1. Dockerized bioinformatics programs at biostar

2. Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker

We developed a freely available, easy to run implementation of bcbio-nextgen on Amazon Web Services (AWS) using Docker. bcbio is a community developed tool providing validated and scalable variant calling and RNA-seq analysis. The AWS implementation automates all of the steps of building a cluster, attaching high performance shared filesystems, and running an analysis. This makes bcbio readily available to the research community without the need to install and configure a local installation.

The entire installation bootstraps from standard Linux AMIs, enabling adjustment of the tools, genome data and installation without needing to prepare custom AMIs. The implementation uses Elasticluster to provision and configure the cluster. We automate the process with the boto Python interface to AWS and Ansible scripts. bcbio-vm isolates code and tools inside a Docker container allowing runs on any remote machine with a download of the Docker image and access to the shared filesystem. Analyses run directly from S3 buckets, with automatic streaming download of input data and upload of final processed data.

We provide timing benchmarks for running a full variant calling analysis using bcbio on AWS. The benchmark dataset was a cancer tumor/normal evaluation, from the ICGC-TCGA DREAM challenge, with 100x coverage in exome regions. We compared the results of running this dataset on 2 different networked filesystems: Lustre and NFS. We also show benchmarks for an RNA-seq dataset using inputs from the Sequencing Quality Control (SEQC) project.

3. Docker guide at basespace

Lightweight Algorithms for RNAseq Expression Analysis – Sailfish, Kallisto and Salmon

The world of RNAseq expression analysis has a number of exciting developments. We have been covering them in our RNAseq blog, but thought a summary in this bioinformatics blog would be helpful to our readers.

The Problem

RSEM, the commonly used program for RNAseq expression analysis, runs very slowly. I know of one postdoc, who started her RNAseq analysis just prior to the birth of her son Samuel. By the time that RSEM run finished, Samuel finished his PhD and started his postdoc project to run, guess what, RSEM !! Another person hand-counted (with help from beads and abacus) his entire RNAseq library consisting of fifty HIseq files, and still finished before RSEM.

Jokes apart, RSEM is slow and, more importantly, makes researchers dependent on core facilities, cloud vendors, multi-core servers and the most expensive of all – graduate students :). Thankfully the liberating alternatives are here.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!

Sailfish

Sailfish is included in Pandora’s Toolbox for Bioinformatics for being a very efficient (‘lightweight’) program for expression analysis. This code developed by Rob Patro and collaborators truly saved our life, and we wrote about the paper and code several times including ‘(Goodbye RSEM, Sailfish Paper Published)’.

In terms of algorithm, sailfish uses kmer counting to parse the reads into kmers and compares with the reference library stored as perfect hash.

Kallisto

Kallisto, developed by Pachter lab, uses similar lightweight approach and improves on sailfish. Páll Melsted, whose bioinformatics research has been featured in our blog many times, played a big role in developing the computer code. Pachter’s blog post described Kallisto –

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

Salmon

In the meanwhile, Rob Patro, the developer of Sailfish, is not sitting still and has been busy doing a lot of ‘fishy’ things. He brought another fish Salmon to the RNAseq shop. Here is his latest blog post –

Not-quite alignments: Salmon, Kallisto and Efficient Quantification of RNA-Seq data

Kallisto
Nick Bray, Harold Pimentel, Páll Melsted and Lior Pachter introduced a new method and tool, called kallisto for fast and accurate quantification of transcript-level abundance from RNA-seq data. I had a chance to drop by the poster session and speak with the first 3 authors, and also (in the last hour) to read the pre-print of the paper. First, I should say that I think the idea behind kallisto and the implementation of the software are excellent (Lior’s group has a history of developing very cool methods and backing these up with well-written, well-maintained software). Basically, they demonstrate that you can obtain the speed of a tool like Sailfish (which quantifies abundance based on k-mers and k-mer counts) without “shredding” the reads into k-mers, which can sometimes discard useful information contained therein. Along with an efficient implementation and (from what I can see so far, some very nice C++11 source code), these results are obtained mainly through the concept of pseudoalignments.

Essentially, pseudoalignments define a relationship between a read and a set of compatible transcripts (this relationship is computed based on “mapping” the k-mers to paths in a transcript De Bruijn graph). The pseudoalignment here gives us more information than the set of individual k-mers, since the k-mers in a read remain “coupled” when the read is assigned to a transcript equivalence class. As the pseudoalignments are generated, equivalence classes are computed and maintained — similar to the concept of equivalence classes among reads as introduced in the IsoEM paper or to equivalence classes among k-mers as used in Sailfish.

The results of kallisto are very impressive, and the idea of pseudoalignments, is, I think, a very good one (more on this below). Essentially, we are converging upon an answer to the question, “what information is really needed to obtain accurate quantification estimates?”, and trying to develop methods that use this information to compute answers as efficiently as possible (but no more efficiently ;P).

Salmon

One of the reasons kallisto and it’s ideas are so close to my heart is that, after the publication of Sailfish, Carl and I didn’t stop thinking about where the Sailfish model might fall short and how it might be improved. In a somewhat awesome twist of fate, while the kallisto team was in some sense “inspired” (or at least provoked to deep thought) by the approach of Sailfish, we were also inspired by some of the ideas introduced in Lior’s (and his previous student Adam Roberts’) prior RNA-seq quantification tool, eXpress.

Pandora’s Toolbox for Bioinformatics

In the Pandora’s Toolbox for Bioinformatics, we already included the Salmon version of Sailfish. As soon as I get time, I will include the Kallisto code and algorithm in the free book and the corresponding github code. The main goal here is to create a single download point for a small number of important and efficient bioinformatics programs so that an user can get started quickly.

Stay away from digital normalization

If you are a bioinformatics technician, who uses digital normalization several times on every data set without thinking much, the following blog post of Pachter and the related discussions would be useful.

Digital normalization revealed

It turns out that Tic Tacs are in fact almost pure sugar. Its easy to figure this out by looking at the number of calories per serving (1.9) and multiplying the number of calories per gram of sugar (3.8) by 0.49 => 1.862 calories. 98% sugar! Ferrero basically admits this in their FAQ. Acting completely within the bounds of the law, they have simply exploited an arbitrary threshold of the FDA. Arbitrary thresholds are always problematic; not only can they have unintended consequences, but they can be manipulated to engineer desired outcomes. In computational biology they have become ubiquitous, frequently being described as “filters” or “pre-processing steps”. Regardless of how they are justified, thresholds are thresholds are thresholds. They can sometimes be beneficial, but they are dangerous when wielded indiscriminately.

There is one type of thresholding/filtering in used in RNA-Seq that my postdoc Bo Li and I have been thinking about a bit this year. It consists of removing duplicate reads, i.e. reads that map to the same position in a transcriptome. The motivation behind such filtering is to reduce or eliminate amplification bias, and it is based on the intuition that it is unlikely that lightning strikes the same spot multiple times. That is, it is improbable that many reads would map to the exact same location assuming a model for sequencing that posits selecting fragments from transcripts uniformly. The idea is also called de-duplication or digital normalization.

Digital normalization is obviously problematic for high abundance transcripts. Consider, for example, a transcripts that is so abundant that it is extremely likely that at least one read will start at every site (ignoring the ends, which for the purposes of the thought experiment are not relevant). This would also be the case if the transcript was twice as abundant, and so digital normalization would prevent the possibility for estimating the difference. This issue was noted in a paper published earlier this year by Zhou et al. The authors investigate in some detail the implications of this problem, and quantify the bias it introduces in a number of data sets. But a key question not answered in the paper is what does digital normalization actually do?

Using Humans as a Model Organism – Sydney Brenner

Here are some thought-provoking ideas from an old lecture of Brenner (emphasis ours).

Brenner, who was one of the 2002 Nobel Prize winners in Physiology or Medicine, helped discover messenger RNA. He also pioneered the use of the soil roundworm Caenorhabditis elegans as a model organism, opening the door for new insights into developmental biology, aging, and programmed cell death.

But these days he’s pushing a new model organism: humans. “We don’t have to look for a model organism anymore,” Brenner said. “Because we are the model organisms.”

As researchers unravel the genome, it’s easier than ever to evaluate human biology directly, rather than extrapolating it from research on other animals, he said. Human research happens all the time in society — in families and communities to governments and religions, Brenner mused, “Why don’t we now use this to try to understand our genomes and ourselves?”

He acknowledged that there are still challenges to interpreting genetic information. But Brenner argued that the extensive variation between individuals could hold a wealth of information. “It is the variation that has become the interesting thing to study,” he said.

Even so, completely analyzing the genetics of tens of thousands of humans remains technically impractical — and prohibitively expensive. Even as sequencing becomes cheaper, Brenner noted, interpreting the data will likely remain challenging.

“What we need, actually, is a view of all this that tests hypotheses all the time,” Brenner argued. This includes studying “human mutants” — something that may not be as difficult as it sounds given that, “We’re all mutants, basically. It’s hard to find a wild type.”

You can read the rest in this link – ‘Sydney Brenner Urges Cancer Researchers to Consider ‘Bedside to Bench’ Approach‘.

One thing for sure, Brenner was not advocating about collecting tons of genomic data ENCODE style. In fact, he cannot even stand any of the ‘omics’ sciences. Here are the quotes from his other speeches and comments.

The title of his talk was “The Next 100 Years of Biology,” but Brenner, whose scientific triumphs include establishing the existence of messenger RNA, shied away from speculation. Instead, he asked, “What should we do over the next 100 years?”

“I think a lot of (biology) is going in absolutely the wrong direction,” he said.

The Human Genome Project, for example, has led to what Brenner called “factory science” — heavy investment in expensive gene sequencers that begin to drive the direction of research.

“You have 100 machines; you’re looking at about $100 million of investment,” he said. “And you’ve just got to keep that going all the time in order to get the use from them.”

This, he said, and the “Genburo, the Politburo of Genetics, in which everything is decided … stultifies” research, and discourages young people from entering the field.

Brenner contends that the organizing principle for thinking about the genome can be found in the cell, the basic unit of life. In an essay he published in the January 12, 2010, issue of Philosophical Transactions of the Royal Society B, Brenner outlined a project called CellMap, which would catalogue every type of cell in the body and detail how different genetic regions (not genes) behave in each cellular environment. He compared it to a city map that identifies each house, the people who live inside it, and the interactions within and between the houses. “I think we should be doing genetics, not genomics,” says Brenner. “When you do genetics, you are focusing on function. When you do genomics, these are just letters and numbers. Nobody bothers about the connections.”

We earlier talked about the alternative proposed in his Philosophical Transactions of the Royal Society B article “Sequences and Consequences” here.

Seventy Years Since the End of WWII

World_War_II_Casualties

Top Ten Genomes – (x) The Human Genome

[Text coming soon, but you know why human genome is among the top ten genomes :) ]

The human genome gets the top position for two reasons.

(i) For being the most unusually usual. Every attempt to show ‘our genome’ to be exceptional failed so far.

(ii) For attracting the largest number of mountebanks.

Take the number of coding genes for example. When the human genome was being sequenced, Ewan Birney used to run gene sweepstakes to get predictions for the number of genes, and the estimates made by scientists ranged from 30K to 150K with the higher side being more popular. “Fruitfly has 15K gene and we are a lot more exceptional than fruitflies” was the sort of argument.

It all began in a bar. Three years ago, as the DNA sequence of the human genome was nearing completion, biologists’ estimates of the number of its genes ranged from 28,000 to 140,000. At the bar at the Cold Spring Harbor Laboratory one evening, Dr. Ewan Birney, had the idea of opening a sweepstakes. He invited researchers to register their best estimates of the number of genes, with the winner — with the guess closest to the final number — to be announced this year. Bets cost $1 in 2000, $5 in 2001 and $20 since last year.

The final count came to be no more that 20,000, which is the same number, plus or minus some, for almost all vertebrates. Yet, nobody wanted to believe it and many still do not want to believe it. From the same 2003 NY Times article –

“The gene count will certainly go up from the 30,000 that people currently” claim,’ Dr. Snyder said in an interview. “The message out there is that there is clearly a lot more coding information.” Pressed for an estimate, he replied, “I’ll guess total genes — over 40,000.”

Snyder spent all his life trying to reach that 100,000 number with a series of failed argument (the latest one being alternative splicing). He even went to the extent of sequencing his own genome (‘Snyderome’), possibly believing that an exceptional person like him would have another 30% more genes than other earthlings.

———————————-

That brings us to the point of mountebanks. There are just too many associated with the human genome – Birney, Snyder, Collins, Cole, Frederickson – to name a few. It is because only the kind of money associated with the human genome can support a large number of them.

Francis Collins retracted five papers in 1996 and blamed it all on his co-author. He claimed that as a senior author of two-author paper, he had no clue of what was going on !! If the discoveries qualified for Nobel prize instead, Collins would have been the first in line to claim credit. Instead of being banned from being a scientist for this ‘heads I win, tails my junior colleague loses’ feat, Collins had been rewarded with higher and higher positions in the human genome hierarchy.

Snyder published two times as many papers as Eric Davidson over the last decade but with hardly any discovery to report. In the meanwhile, Davidson reinvented developmental biology by working with the sea urchin genome and wrote the most profound book that will change our understanding of evolution. Speaking of Snyder, the other day I read in twitter that he and his student ‘invented RNAseq’, as claimed by his student in a talk !!! Only Snyder and people associated with Snyder can pull such chutzpah. What is there to invent? I heard about measuring gene expression using high-throughput sequencing from Eric Davidson ever since Mortazavi and Wold started their ChIPseq experiments in Caltech.

Ewan Birney was profiled by Science in 2012 as ‘Genomics’ Big Talker’. He wrote an article explaining how big science had been successful in changing biology, such as in ‘writing eulogy for junk DNA’. We know how that fiction ended.

On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE

However, it is noteworthy, how Birney responded to criticisms of his so-called textbook-changing scientific discovery. Sandwalk blog reported in 2012

Birney has a blog [Response on ENCODE reaction] but he has not responded to questions. Check out Ryan Gregory’s post to see what Birney is avoiding: Comments on Birney’s blog. Pay particular attention to the questions asked by Diogenes. Let’s hope that the reason for Birney’s silence is because he’s preparing a lengthy and scientifically accurate response!

I wonder if Science is going to publish anything else on this fiasco? Most of the other journals have at least acknowledged that there’s a problem with the ENCODE publicity campaign. Some have even defended junk DNA and emphasized the misleading statements published by Birney et al. So far, there’s nothing on the Science website in spite of the fact that Science published one of the worst interpretations of the ENCODE results…. Or maybe it’s BECAUSE it published such a biased account that we’re not seeing any follow-up.

Not answering question about scientific theories is so 2012. These days, Birney is outright banning others from asking him critical questions.

Capture

Cole and Frederickson (the positivity lady) are crooks of a different dimension, who connect between the human genome and happiness/sadness of people. After we showed that their paper was complete nonsense, they went to the NCBI GEO database and surreptitiously uploaded a new file with modified raw data !! Needless to say, that invalidated their original paper, but who can fight unlimited funding for junk science from NIH?

Horror !! NIH is Now Funding Loving Kindness Meditation of Positivity Lady through Multi-year Grant