New Book: RNA-seq Data Analysis: A Practical Approach

This Article is Sponsored by CRC Press

A new book has come out on RNA-seq data analysis and it is written by none other than our long time reader Mikael Huss, along with four other authors. You can get the book at the publisher’s website through following link.

RNA-seq Data Analysis: A Practical Approach (By: Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, Garry Wong) – ISBN 9781466595002

Before you do that, please feel free to enjoy the first chapter embedded below.

Nanopore – Truth and Nothing but the Truth

At last we get the analysis of Oxford Nanopore data that we had been looking for since first day. Michael Schatz posted the GI2014 slides of James Gurtowski from his lab in his website.

Read Length Distribution

Read length goes all the way to ~147Kb, but the longer reads are problematic (see next section).

Mean read length is 5.4Kb.

Aligned Reads

Only 32% of the reads match anywhere in the genome. Long reads have high failure rate. Here is the relevant slide.


Also note that the reads, which match, may only align partially (~50%). Check slide 10.

Error rate

For reads that match, average accuracy is 64%. That number increases to 70% for the ‘2D base calling’ reads. These are much better than the initial report of Mikheyev/Tin and that is definitely good news. Moreover, another poster presentation at the conference reported similar numbers.

Distribution of errors – 57% Mismatches, 32% Deletions, 11% Insertions.

Please note that the distribution is quite different from PacBio, which is mostly dominated by ~70% insertions.

Implication for Assembly and Other Work

1. One thing not clear from the slides is whether the errors are distributed randomly within reads or have bias. This is important for ‘perfect assembly math’ presented by Gene Myers in his AGBT talk.

2. What is the probability that a random base from a 40KB read is actually from the genome? We know that the read is 32% likely from the genome and within the read, 50% match the genome on average. Also, the rate of accuracy for matching reads is 70%. So, the probability is 0.32*0.5*0.7=11%.

The middle number will be much better for shorter reads and worse for the long reads.

3. With 32% matching rate and 65-70% accuracy for matching regions, it will be impossible to do assembly from Nanopore reads alone. This will have implications for doing field work. The assemblies in the presentation were done by combining Nanopore and Illumina reads.

An important point in this respect –



A week back, we mentioned about Nick Loman’s release of E. coli data at Gigascience website. Readers will find the following arxiv paper helpful (h/t: @lexnederbragt).

Background: The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. By measuring the change in current produced when DNA strands translocate through and interact with a charged protein nanopore the device is able to deduce the underlying nucleotide sequence. Findings: We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION device during the early-access MinION Access Program (MAP). Two sequencing runs of the MinION were performed using the R7 chemistry (released July 2014) and one using R7.3 (released September 2014). Conclusions: Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods. Datasets are provided through the GigaDB database Keywords: genomics; nanopore sequencing

MEGAHIT: An Ultra-fast Single-node Solution for Large and Complex Metagenomics Assembly via Succinct de Bruijn Graph

About two years back, we reported about Succinct de Bruijn Graph construction by Alex Bowe and collaborators. Also, earlier this year, HKU group of professor Tak-Wah Lam published their implementation of GPU-Accelerated BWT Construction for Large Collection of Short Reads. Now those two are combined along with ideas from IDBA-UD into a metagenome assembler. The paper is available from arxiv.


MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it avoids pre-processing like partitioning and normalization, which might compromise on result integrity. MEGAHIT generates 3 times larger assembly, with longer contig N50 and average contig length than the previous assembly. 55.8% of the reads were aligned to the assembly, which is 4 times higher than the previous. The source code of MEGAHIT is freely available at this https URL under GPLv3 license.

MUSiCC for Metagenomes and Lior Pachter’s Code for Transcriptomes

Lior Pachter is having a discussion on the mathematical aspects of digital normalization (Digital normalization revealed). This is related to his work with David Tse’s group and we posted the relevant tutorial slides here. Pachter and his student Sreeram are developing a code to analyze de novo assembled RNAseq data.

In the context of doing similar mathematical analysis for metagenomes, the readers may find the following paper useful.


Functional metagenomic analyses commonly involve a normalization step, where measured levels of genes or pathways are converted into relative abundances. Here, we demonstrate that this normalization scheme introduces marked biases both across and within human microbiome samples and systematically identify various sample- and gene-specific properties that contribute to these biases. We introduce an alternative normalization paradigm, MUSiCC, which combines universal single-copy genes with machine learning methods to correct these biases and to obtain a more accurate and biologically meaningful measure of gene abundances. Finally, we demonstrate that MUSiCC significantly improves downstream discovery of functional shifts in the microbiome. MUSiCC is available at

How Much of Complex Diseases are Controlled by the Genome? (Ken Weiss)

I have been reading the book Genetics and the Logic of Evolution by Ken Weiss and Anne Buchanan, and found it thought-provoking. The impatient readers are encouraged to first take a look at this 2005 review paper, which presents their ideas in condensed form –

The Phenogenetic Logic of Life

Our regular readers are already familiar with their excellent blog – ‘The Mermaid’s Tale’ named after their later book, which covers topics similar to ‘Genetics and the Logic of Evolution’.

Today’s blog post is about a Penn State presentation, where professor Weiss discussed the extent by which complex diseases are ‘controlled’ through our genomes. Before listening to the entire talk, please fast forward to 46:00, where he showed the roles of genes and environment. The following figure presents the risk of breast cancer by age group of those born in USA before 1940s and after 1940s. If the genomes did not change, isn’t the dramatic rise in risk entirely due to environment? He presented a test case from Finland that suggests so.


You can listen to the full talk here –

Genome Informatics 2014 Conference (#gi2014)

The conference is taking place between 21-24th September. In twitter, follow #gi2014 hashtag to see comments from attendees.

Program -

Scientific Programme Committee:
Jennifer Harrow Wellcome Trust Sanger Institute, UK
Daniel MacArthur Massachusetts General Hospital, USA
Michael Schatz Cold Spring Harbor Laboratory, USA

Keynote Speakers:
Roderic Guigo, CRG, Spain
Serafim Batzoglou, Stanford University School of Medicine, USA

Invited Speaker & Sessions Topics:

Personal and Medical Genomics
Atul Butte, Stanford University School of Medicine, USA
Elizabeth Worthey, Medical College of Wisconsin, USA

Epigenomics and Non-Coding Genome
Rafael Irizarry, Harvard School Of Public Health, USA
Zhiping Weng, University of Massachusetts Medical School, USA
Databases, Data Mining, Visualization, Ontologies and Curation
Peter Robinson, Charité-Universitätsmedizin Berlin, Germany
Jennifer Wortman, Broad Institute, USA

Sequencing Pipelines and Assembly
Zamin Iqbal, University of Oxford, UK
Jared Simpson, Ontario Institute for Cancer Research, Canada

Comparative, Evolutionary and Metagenomics
Janet Kelso, MPI-EVA, Germany
Nick Loman, Birmingham University, UK

Transcriptomics, Alternative Splicing and Gene Predictions
Tuuli Lappalainen, New York Genome Center, USA
John Marioni, EMBL-EBI, UK

R. Bacigalupe


The current talk is from Samarth Rangavittal on #Sequencing #Gorilla Y Chromosome using PacBio. Follow the discussions real time here at twitter. We will add any interesting comment we find below.


On Nanopore, Nick Loman said –



Jared Simpson on assembly-based variant calling –


Antarctic Sea Ice Reaches Highest Levels Since Record Began

ABC News reports

Scientists say the extent of Antarctic sea ice cover is at its highest level since records began.

Satellite imagery reveals an area of about 20 million square kilometres covered by sea ice around the Antarctic continent.

Jan Lieser from the Antarctic Climate and Ecosystems Cooperative Research Centre (CRC) said the discovery was made two days ago.

“This is an area covered by sea ice which we’ve never seen from space before,” he said.

“Thirty-five years ago the first satellites went up which were reliably telling us what area, two dimensional area, of sea ice was covered and we’ve never seen that before, that much area.

Zerohedge adds more color and a satellite image-

In what appears to be an awkward moment of uncomfortable fact, ABC reports satellite imagery reveals an area of about 20 million square kilometres covered by sea ice around the Antarctic continent – the highest level of coverage since records began. This is the 3rd year in a row that the sea ice coverage has reached a record level – increasing at 1.5% each decade since 1979. However, there is another side to this, as the area covered in sea ice expands scientists have said the ice on the continent of Antarctica which is not over the ocean continues to deplete. The climate is changing, one way or the other.


Not only in Antarctica, but globally the temperature has not risen from where they were 15 years back, in what Acting-man blog calls The “Inconvenient Pause”.

Given that several years ago, climate scientists admitted that their models would have to be thrown out if the “pause” lasted longer than 15 years, it may be a bit surprising to find that not only have they not been thrown out, but the scaremongering in the media generally continues to be as shrill as ever, if not more so.

In the meantime, about 40 different explanations for the pause have been proposed in scientific papers. Obviously, not all 40 can be correct, but a thread that is unifying most of them has emerged: the pause is the result of, well, nature. Taking a layman’s stab at this, we would propose that the same is true of the most recent warming period and the 30 years of cooling that preceded it. In other words, human influence on the climate is probably so small as to be negligible, and there is no cause for alarm.

So why the continued alarmism? Simple: it sells. Newspapers for instance aren’t sold with stories about there being no reason to be alarmed about something or other. Alarmist scientists face some risk to their reputation, but more importantly, a huge gravy train could be derailed if they admit that they have been wrong all along. As we have previously pointed out, before the current climate alarmism really got going in the 1980s, climate science was a rather small, sparsely funded scientific niche. Along with all the other government-subsidized activities that have sprung from climate alarmism (think e.g. “green, but economically non-viable energy”) it has in the meantime grown into a quite sizable tax-payer funded racket.


Bob Hoye – Tyrannical Duncery (continued)

Global Warming Nonsense

With a degree in Geophysics this writer has found the story about man-caused global warming fascinating. Fascinating in its blatant abuse of the discipline and skepticism that has driven the advance of science for thousands of years. Even more fascinating is that since catastrophic warming displaced global cooling concerns of the 1970s it has become another great experiment in authoritarian science.
The last such example involved a bitter struggle between the establishment and the giants in science Copernicus, Galileo and Kepler. The establishment spent considerable effort to incorporate growing astronomical evidence into their Earth-centric model. They imagined crystal spheres within spheres with absolutely impossible planetary motions. Officialdom promoted absurdity. Research contrary to authoritarian dogma was not permitted.
A prosperous middle class devoured books and pamphlets that criticized the establishment’s dogma.

Something similar prevails today and today’s “deniers” should be relieved that the full censure of the movement does not include the threat of capital punishment. In which case, the continued assembly of climate data with reason and logic independent of state propaganda will win the argument.

Can the issues be described as state propaganda?

The remedy for every intense concern about global warming and climate change is an increase in taxation and regulation. By the state, of the state and for the state.
In January 2008 this writer published “Intellectual Hysteria” which reviewed the promoted ‘political science’ as well as actual science. The main points were that the CAGW movement promoted that only man-generated CO2 was the cause of global warming. It was not the first time that mankind was deemed evil. Likely it won’t be the last.
The science part was a review of climate changes based upon orbital mechanics of the solar system. Earth scientists not on the government band wagon appreciate how climate history really works.
The title “Intellectual Hysteria” was based upon the long history of charismatic intellectuals discovering a personal revelation and promoting it. Usually part of the personal concern included that there were too many people on the planet. In the politically turbulent 1860s, Stanley Jevons calculated that the supplies of coal would soon be exhausted. His book was called The Coal Question. In the politically turbulent 1790s, Thomas Malthus promoted his personal anxieties under the rubric later described as “Malthusian Catastrophe”.

Cross-post: The prestige of “Jewish Genome” papers

When, in early 2013, we covered Eran Elhaik’s paper on the origin of European Jewish population, he was treated everywhere like a pariah. Haldane’s Sieve blog, which supposedly posts all interesting pop-gen preprints from arxiv, completely ignored his paper. Yet, as it turned out over the last 18 months, Elhaik’s paper became the second most well-read among all papers published in Genome Biology and Evolution, even surpassing “On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE”. Links from our previous blog posts covering Elhaik’s research are given below.

New Study Sheds Light on the Origin of the European Jewish Population

Few Useful Links on Khazar History for Population Biologists

Population Genetics Study on Khazar Origin of Jews is Back in Spotlight

In his blog, Elhaik has written a critic of a new Jewish Genome paper, which we are cross-posting below.


Cross-post: The prestige of “Jewish Genome” papers

Every bad scientific theory consists of three parts or acts. The first part is called “The Pledge”. The researcher shows you something ordinary: a DNA from a kid who danced with semi naked dancers in his Bar mitzvah and thereby is a Jew, a DNA of men who wear tight swimsuits and have one of the most common surnames in Israel (Cohen), or just man. He shows you this item. Perhaps he asks you to inspect its raw data to see if it is indeed real, unaltered, normal. But of course…it probably is. The second act is called “The Turn”. The researcher takes the ordinary something and makes it do something extraordinary, like proving it’s the descendant of King David or Aaron the Cohen. Now you’re looking for the secret… but you won’t find it, because of course you’re not really looking. You don’t really want to know. You want to be fooled. Even if you want to challenge this, you should know that it would never get published. But you wouldn’t clap yet. Because proving something that is impossible doesn’t qualify as science even if it comes with a p-value; you have to consider the criticism. That’s why every bad scientific theory has a third act, the hardest part, the part we call “The Prestige”. (with apologies to Christopher Nolan, writer of “The Prestige”)

When I read the new Carmi et al. (Carmi et al. 2014) paper I realized it is the prestige, an end of an era where the genomes of ordinary Jews are shown to be extra ordinary. The first Jewish genome papers were noting but inordinary. Nurit Kirsh described in her excellent review (Kirsh 2003), these “Jewish Genome” type of papers. She noted that those written by Israelis and Jews attempted to justify the Zionist myth to provide a “confirmation” to the invented national Jewish identity by “proving” a common historical origin:

In half of the articles by the Israeli researchers, sources such as a history book, the Encyclopaedia Judaica, and even the Bible appear in the bibliography.

Israeli geneticists often expressed the idea that “all Jews are brothers.” Gurevitch and his coauthors compared the Ashkenazis to their Sepharadi and Oriental “brethren,”and the same term appears in their article on Moroccan and Tunisian Jews.

At the beginning of one of their articles Sheba and his colleagues quote from Genesis 43:9: “I will be surety for him; of my hand shalt thou require him.”50 In this verse Judah makes a commitment to his father, Jacob, that he will guarantee the well-being of his young brother Benjamin. Neither here nor anywhere else in the biblical chapter is there any hint of G6PD deficiency, but the quotation is a reminder that even when the researchers are talking about a population of hundreds of thousands, the reference is to a large family-a family originating from a father, his four wives, and their twelve sons. (Kirsh 2003)

Those were the days where scientific texts were as elevating as prayers. That hype did not end in the ’60s where Kirsh’s review has ended. That trend continues well into the 21st century melting together elements from science, politics, religion, and comedy. Harry Ostrer, one of the authors of (Carmi et al. 2014), offers in Legacy (Ostrer 2012) his racial theories, a new solution to the troubled Middle East, and even a hope for the return of Christ:

The evidence for biological Jewishness has become incontrovertible. (P 217)

The stakes in genetic analysis are high. It is more than an issue of who belongs in the family and can partake in Jewish life and Israeli citizenship. It touches on the heart of Zionist claims for a Jewish homeland in Israel. One can imagine future disputes about exactly how large the shared Middle Eastern ancestry of Jewish groups has to be to justify Zionist claims. (P 220)

And glorious lineages with genetic lines of descent from a king – even a Messiah – may become even more prized than the purported Cohanim modal haplotype was prized over the last decade. (P 220) (Ostrer 2012)

Don’t bother calling the people in white suits yet. More to come. The year of 2010 was no doubt the high year for “Jewish Genome” papers with a marvelous harvest of three papers in three prestigious journals. Indeed, some of us were concerned that science would ever be able to cleanse itself from these stains (Atzmon et al. 2010; Behar et al. 2010; Bray et al. 2010). It barely did, but before discussing that let us review some of the points of these papers:

Behar at al. (Behar et al. 2010) offered to distinct dark-skin Jews from pale-skin ones:

In contrast, Ethiopian Jews (Beta Israel) and Indian Jews (Bene Israel and Cochini) cluster with neighboring autochthonous populations in Ethiopia and western India, respectively, despite a clear paternal link between the Bene Israel and the Levant.

alleging they are descended of converts, thus practically excluding them from the Jewish Genome continuum. Atzmon et al. (Atzmon et al. 2010) evoked god almighty to explain their data because nothing else would:

Rapid decay of IBD in Ashkenazi Jewish genomes was consistent with a severe bottleneck followed by large expansion, such as occurred with the so-called demographic miracle of population expansion from 50,000 people at the beginning of the 15th century to 5,000,000 people at the beginning of the 19th century.

And finally, (Bray et al. 2010) managed to prove that the secret and most guarded super power of Jews was their ability to consume alcohol without getting drunk, which is particularly funny given the high rates of alcoholism in Israel (see the previous link):

Our results, together with a recent study showing that variation in the ALDH2 promoter affects alcohol absorption in Jews (48), now suggest that genetic factors and selective pressure at the ALDH2 locus may have contributed to the low levels of alcoholism.

Ah those were the days that “science” could bleach any nonsense and made our day. Those were the days that we felt we could march to Iraq, which is within the borders of the Promised Land, and demand our land back from ISIS. These are the days that we could all drink and joy in our superior intellect (see Ostrer’s book and that of Wade’s).

Then came the criticism in the form of my paper (Elhaik 2013) that pulled the carpet under this deck of cards. The paper received worldwide press coverage and is now ranked #2 most read papers in the journal of Genome Biology and Evolution, almost two years after its publication with its highlight ranked #3 (it used to be #1). (By the way, our ENCODE criticism (Graur et al. 2013) is ranked #5.) That paper showed that Jews are highly mixed with Middle Eastern, Caucasus, and Easter/Western European elements and attribute their large presence in Eastern Europe in the contribution of Khazars-converts who joined the faith around the 8th century. The paper debunked Ostrer’s divine intervention in the history of Jews as follows:

A major difficulty with the Rhineland Hypothesis, in addition to the lack of historical and anthropological evidence to the multi-migration waves from Palestine to Europe (Van Straten 2003; Sand 2009), is to explain the vast population expansion of Eastern European Jews from 50 thousand (15th century) to 8 million (20th century). The annual growth rate that accounts for this population expansion was estimated at 1.7-2%, one order of magnitude larger than that of Eastern European non-Jews in the 15th-17th centuries, prior to the industrial revolution (Van Straten 2007). This growth could not possibly be the product of natural population expansion, particularly one subjected to severe economic restrictions, slavery, assimilation, the Black Death and other plagues, forced and voluntary conversions, persecutions, kidnappings, rapes, exiles, wars, massacres, and pogroms (Koestler 1976; Van Straten 2003; Sand 2009). Because such an unnatural growth rate, over half a millennia and affecting only Jews residing in Eastern Europe, is implausible – it is explained by a miracle (Atzmon et al. 2010; Ostrer 2012). Unfortunately, this divine intervention explanation poses a new kind of problem – it is not science. The question how the Rhineland Hypothesis, so deeply rooted in supernatural reasoning, became the dominant scientific narrative is debated among scholars (e.g., Sand 2009).” (Elhaik 2013)

And so we return the paper by Carmi and colleagues (Carmi et al. 2014), now stripped of miracles or magic tricks it is nothing but dull, plodding, pedantic, and most of all boring. Carmi and colleagues manage to argue that:

“Ashkenazi Jews (AJs) are a genetic isolate close to European and Middle Eastern groups” and that “AJ [are] an even admixture of European and likely Middle Eastern origins” without realizing these are two different things. The confusion continues when they state that “(AJ), identified as Jewish individuals of Central- and Eastern European ancestry.” Of course, none of these so-called ancestries are defined in any meaningful manner that allows us to understand its geographical origin.

However, this is the unfortunate end of the single Middle Eastern origin, one of the core elements of the “Jewish Genome” paradigm and the one the same authors preached for in their previous paper.

Imagine how we would lose face to ISIS after claiming our land back waving Ostrer’s book in front of them and they will show us Ostrer’s new paper entitling us to only half the land (by Ostrer’s method). How fortunate are we for not listening to Ostrer in the first place, but then again, why should we believe him now? And what would the Palestinians say? They are probably much closer to 100% Middle Eastern than the hafling mixed-blooded Jews? Is this the end of the Jewish state? What does it mean about Jewish continuity? If are we now half Jews, who has our other half? So many [stupid] questions…

It gets worse. In their new realization, the authors precluded the demographic miracle (now called “Explosive”! how exciting) and propose a bottle neck some 600-800 years ago around the time the Judaized Khazars arrived to Europe. The authors don’t really use the K-word nor cite my paper, which would be the proper thing to do, but then again their paper is embarrassing enough as it is, so we will let it go this time. How did the author explain the massive demographic presence of Jews in Eastern Europe without invoking miracles of the Khazars? Carmi et al. now proposes to explain it with multiple expulsions! In other words, Jews were expelled from western/central European countries and ended up in Eastern Europe! How come we didn’t think about it before? Of course, many people did and had Carmi et al. actually have educated themselves about this history of the population they study they could see that the numbers don’t add up. Even in their new imaginative scenario where all expelled Jews would have travelled to Eastern Europe, there are still few million Jews short (Straten 2011).

No need to expand on the meaning of “asymmetric gene flow from Europeans into the ancestral population of AJ” estimated now at 46-50% and its devastating consequences to the purity of Jews, but let us just mention some other no less imaginative statements made by the same authors such as:

“The rate of admixture is estimated to be 0.5% per generation over the 80 generations since the founding of the Ashkenazi Jews, indicating that this group might have remained endogamous throughout much of its history and that the offspring of those who married outside were lost as members of the community10.” (Ostrer 2001)

We also note that:

  • Carmi et al. have absolutely no way to conclude that Jews have originated as Middle Eastern population that mixed European, rather than the opposite direction.
  • Carmi et al. never tested alternative scenarios of admixture, they simple choose 2-model population (perhaps due to memory limitation of the computers in Columbia University), although it doesn’t take so much to consider a more mixed model. The authors simply were not interested.

If we had any hope that the authors would grow old and wise, their figure S1 precludes such possibility. This is a somewhat typical PCA plot supposedly used to confirm Jewishness, but there is nothing typical about it. First, the authors do not say what proportion each principal component explains, it must be a really big secret. Second, the authors only used 47,713 SNPs after supposedly removing SNPs in high LD. This is very interesting because both Behar et al. (Behar et al. 2010) and myself (Elhaik 2013) applied the same approach to only half a million SNPs and found over 220,000 markers. Carmi et al. had the whole genome and yet they obtained less than a quarter of the markers we analyzed, which raises concerns that these markers were either pre selected or that the computers in Columbia University have not been upgraded for the past 20 years and cannot analyze a large number of markers. Another point of concern is the selected few populations in the plot out of >50 HGDP populations. Of course none of these questions are relevant as the sole purpose of this plot is to show that Jews are a “population isolate” floating in a sea of genetic uniqueness that is not shared by any other population. Too bad that once again, this is not science.

The Carmi et al. (Carmi et al. 2014) is an end of an era that most likely culminated with my paper although not cited (Elhaik 2013), as much as the ENCODE era has ended with our criticism, which also went uncited. Fortunately, we still have the papers from the good old days to cheer us up!


Atzmon G, et al. 2010. Abraham’s children in the genome era: major Jewish diaspora populations comprise distinct genetic clusters with shared Middle Eastern Ancestry. Am. J. Hum. Genet. 86:850-859.

Behar DM, et al. 2010. The genome-wide structure of the Jewish people. Nature. 466:238-242.

Bray SM, Mulle JG, Dodd AF, Pulver AE, Wooding S, Warren ST. 2010. Signatures of founder effects, admixture, and selection in the Ashkenazi Jewish population. Proc. Natl. Acad. Sci. USA. 107:16222-16227.

Carmi S, et al. 2014. Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nat Commun. 5:4835.

Elhaik E. 2013. The Missing Link of Jewish European Ancestry: Contrasting the Rhineland and the Khazarian Hypotheses. Genome Biology and Evolution. 5:61-74.

Graur D, Zheng Y, Price N, Azevedo RB, Zufall RA, Elhaik E. 2013. On the Immortality of Television Sets: “Function” in the Human Genome According to the Evolution-Free Gospel of ENCODE. Genome Biol Evol. 5:578-590.

Kirsh N. 2003. Population genetics in Israel in the 1950s. The unconscious internalization of ideology. Isis. 94:631-655.

Koestler A. 1976. The thirteenth tribe : the Khazar empire and its heritage. New York: Random House.

Ostrer H. 2012. Legacy: A Genetic History of the Jewish People. OUP USA.

Ostrer H. 2001. A genetic profile of contemporary Jewish populations. Nat. Rev. Genet. 2:891-898.

Sand S. 2009. The invention of the Jewish people. London: Verso.

Straten JV. 2011. The origin of Ashkenazi Jewry : the controversy unraveled. New York: Walter de Gruyter.

Van Straten J. 2003. Jewish Migrations from Germany to Poland: the Rhineland Hypothesis Revisited. Mankind Quarterly. 44:367-384.

Van Straten J. 2007. Early Modern Polish Jewry The Rhineland Hypothesis Revisited. Hist. Methods. 40:39-50.

T. Ryan Gregory Starts ‘Evolution Consulting Service’ after Reading Gibbon Paper in Nature

T. Ryan Gregory, who had been fighting ENCODE’s junk science since 2007, is fed up with another high-profile paper published in Nature. He wrote in Twitter –



The paper in question has over seventy authors, which tends to scare us these days. More authors mean higher chance of ‘faddish stuff’ or ‘fashionable ideas’ getting promoted to the abstract, and higher risk of being humiliated by Dan Graur sooner or later.

Gibbon genome and the fast karyotype evolution of small apes

Gibbons are small arboreal apes that display an accelerated rate of evolutionary chromosomal rearrangement and occupy a key node in the primate phylogeny between Old World monkeys and great apes. Here we present the assembly and analysis of a northern white-cheeked gibbon (Nomascus leucogenys) genome. We describe the propensity for a gibbon-specific retrotransposon (LAVA) to insert into chromosome segregation genes and alter transcription by providing a premature termination site, suggesting a possible molecular mechanism for the genome plasticity of the gibbon lineage. We further show that the gibbon genera (Nomascus, Hylobates, Hoolock and Symphalangus) experienced a near-instantaneous radiation ~5 million years ago, coincident with major geographical changes in southeast Asia that caused cycles of habitat compression and expansion. Finally, we identify signatures of positive selection in genes important for forelimb development (TBX5) and connective tissues (COL1A1) that may have been involved in the adaptation of gibbons to their arboreal habitat.

We asked Ryan by email, what he saw wrong in the paper and he was kind enough to explain. Ken Weiss and co-authors often touch on similar flaws in evolutionary thinking in their interesting blog.

It’s just very basic “tree thinking”, or correctly interpreting a phylogeny. In the case of the gibbon paper, they start off with the following statement:

“In the primate phylogeny, gibbons diverged between Old World monkeys and great apes, providing a unique perspective from which to study the origins of hominoid characteristics.”

This is nonsensical. If you look at a phylogeny of primates, the Old World monkeys are the outgroup to the clade that includes gibbons and great apes. This means: a) any member of that clade is equally closely related to Old World monkeys — gibbons are not closer than humans are to Old World monkeys, b) both the great apes and gibbons are descended from a common ancestor that is not shared by Old World monkeys — it is therefore equally (in)accurate to say that great apes diverged between Old world monkeys and gibbons. Also, these are all modern species. The fact that gibbons are the outgroup to great apes absolutely does not imply that they are similar to the ancestor of the entire clade. Early branching does not equal primitive.

If you are not willing to pay for the measly sum Ryan is charging, he offers a free DIY solution – to read his 2008 paper.

Understanding Evolutionary Trees

Charles Darwin sketched his first evolutionary tree in 1837, and trees have remained a central metaphor in evolutionary biology up to the present. Today, phylogenetics—the science of constructing and evaluating hypotheses about historical patterns of descent in the form of evolutionary trees—has become pervasive within and increasingly outside evolutionary biology. Fostering skills in “tree thinking” is therefore a critical component of biological education. Conversely, misconceptions about evolutionary trees can be very detrimental to one’s understanding of the patterns and processes that have occurred in the history of life. This paper provides a basic introduction to evolutionary trees, including some guidelines for how and how not to read them. Ten of the most common misconceptions about evolutionary trees and their implications for understanding evolution are addressed.

E. coli Nanopore Data Release and #baltiandbioinformatics Nanopore Conference

Nick Loman uploaded his E. coli nanopore data on GigaDB.

Here we present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION device with R7 chemistry during the early-access MinION Access Program (MAP).

Three sequencing runs of the MinION were performed using R7 chemistry. The first run produced 43,656 forward reads (272Mb), 23,338 (125Mb) reverse reads, 20,087 two-direction (2D) reads (131Mb), of which 8% (10Mb) were full 2D. Full 2D reads means that the complementary strand was successfully slowed through the pore.

The R7 protocol has a modification to increase the relative number of full 2D reads (NONI). To exploit this, two new libraries were produced which included an overnight incubation stage (ONI-1 and ONI-2). Each library was run on an individual flow cell. This resulted in 6,534 & 8,260 forward reads, 2,171 & 2,945 reads and 1,740 & 2,394 2D reads (27%, 29%). In this case, 50% and 41.8% of reads were full 2D respectively. The mean fragment lengths for 2D reads from the three libraries were 6,543 (NONI) 6,907 (ONI-1) and 6,434 (ONI-2).

We have taken the reads, created a kmer distribution and then merged that distribution with the kmers from E. coli reference genome (MG1655-K12.fasta). The kmer distribution file for reads can be downloaded from here and genome map can be downloaded from here. A snapshot is shown below:

(column 1: 12mer, column 2: genomic loc, column 3: redudancy in genome, column 4: kmer count in Nick Loman’s data)


Those two files should tell you a lot more about the technology and its potential than many talks, papers and blog posts. So, readers are encouraged to take a look.

Few observations:

1. Good match with genome in most locations

Based on kmer distribution of reads, we checked for kmer coverage of genome and found kmers present from almost all locations of the genome.

2. Poly-A regions are missed.

If point 1 gives you any confidence in the data (or our analysis), you are fooled. Please note that given the number of reads and short length of 12mers, completely noisy data are also expected to have almost all kmers from the genome. However, complete lack of k-mers from some parts of genome is informative and can be definitely ascribed to base-calling errors. We found a few such missing regions, and almost always they had poly-A stretches.


Here is the opposite extreme – genomic regions with too many unexpected kmers in the reads.


3. Kmer distribution – read vs ref

If almost every kmer is expected to appear among the reads, how can we tell whether they contain any information? The following figure could be of help. Dotted line shows occurrence frequency of all kmers from the reads, and solid line shows the same for only those kmers that actually appear in the genome. The solid peak is shifted to the right and that means the kmers in the genome appear more frequently in the reads than random kmers.


4. Unusual kmers in the reads, not from the genome

The reads also have some kmers appearing at high frequency, which are not at all present in the genome.


Analysis 1-4 should be helpful in choosing an appropriate aligner rather than shooting blindly. We will add more, as we analyze data in more detail.


On #baltiandbioinformatics Nanopore Conference

1. Very impressive. This is the future !!

‘This’ refers to Google hangout way of running conferences, which anyone can attend from his/her bedroom and Bill Clinton with his cigar :)


Scott Edmunds’s GigaDB, which we covered earlier, turns out to be another tech winner in the publishing world.

2. Nanopore Technology

Two highlights –

a) SPAdes assembled E. coli using nanopore+ILMN and got good results. Note – they needed ILMN and so did Schatz.

b) Error rate figure –


Clive Brown’s talk is very impressive and he is beaming with infectious optimism. Hopefully Loman’s tests will be able to test how fast that infection spreads :).

His vision is right, but bioinformatics remains the weak point in this project as explained next.

3. Cost of sequencing

If each nanopore stick costs $1000, cost of sequencing is high compared to Pacbio (and Illumina). The mobility nanopore provides is truly revolutionary, but for his vision to work, the price point for the sticks has to be way lower.

We have no idea what the manufacturing costs for those sticks are, but let us assume that he can provide them for $10 a pop. In that case, to maintain a cash flow of $100M or higher (to justify the valuation), data management and analysis has to be the strong point of the company. So, essentially ONT becomes a bioinformatics company.



Interesting viewpoint of the CEO of company that wants to pass itself as a ‘small startup’ from a garage.