In DALIGN Paper, Gene Myers Delivers a Major Blow to His Biggest Competitor


While Gene Myers was busy working on neural images of fruitflies, he noticed the bioinformatics land being taken over by search algorithm originating from a ‘competitor’. Now he comes back with an outstanding paper (DALIGN) to show why he is still the master.

That competitor of Gene Myers is none other than Gene Myers, 20 years younger. In early 90s, he and Udi Manber (currently leading the search group at Google) introduced the concept of suffix arrays. Fast forward by two decades, every nucleotide search program for new sequencing technologies uses suffix arrays, combined with BWT transform and FM indices. We covered that history in an earlier commentary (check this blog post from 2011).

Myers tried to use BLASR, which incorporated suffix arrays, for aligning PacBio reads and found it to be too slow. The troubles come from memory bandwidth and cache incoherence. Modern processors have three levels of cache (L1, L2, L3) and the speed difference between having all data in L1 cache vs DRAM can be 100x. BWT-based FM index algorithm makes effectively random accesses to memory locations and therefore the search process gets highly dependent on DRAM access. Especially when one tries to scale BWT-based alignment programs with multiple cores, memory bandwidth becomes the major bottleneck. Our readers are well familiar with that drawback (Check “The Future of Computers – Multicore and the Memory Wall”).

The algorithm proposed by Myers overcomes that bottleneck in two steps. They are so elegant that even the memory-heavy part of his algorithm sees 3.6x scaleup in a 4-core machine, whereas the less memory-intensive part sees 3.88x scaleup. The steps are – (i) finding read pairs highly likely to align based on matching seeds and (ii) alignment. Even though they sound innocuous, the beauty is in the details. The memory-intensive rapid seed detection part introduces a highly cache-coherent radix sort algorithm that attempts to keep relevant bits in L1 cache. The rapid local alignment part uses Myers’ own O(nd) algorithm from mid-80s (see “How Does the Unix Diff Algorithm Work?”) , but with several optimization steps to limit the rapidly expanding parallel waves at every step of search. Especially it makes use of random distribution of errors in PacBio reads.

What next after this formidable accomplishment of speeding up PacBio alignment 25X over BLASR and thus making large assemblies possible? There are rumors that Gene Myers plans to demonstrate one or more of the following three – (i) walk over water on river Elbe, (ii) design a perfect assembler, (iii) finish the zebrafish genome. Given how difficult (ii) and (iii) turned out to be so far, we believe he will go for (i).

If, instead, Myers plans to finish Danio and thousands of other recently published incomplete genomes, he needs to hurry up. We heard that Udi Manber, who is currently at Google, is working on a different solution for those bad genomes. Google plans to replace all “NNNNNN” regions with targeted ads, and there is a possibility that Google will buy a short-read company, because more gaps, the better.


The full article by Myers will appear in WABI 2014 (Workshop on Algorithms in Bioinformatics), Sept 8-10, Wroclaw, Poland. In the meanwhile, readers interested to know more about his developments can check his blog. The code of DALIGNER is available from github.

Antarctic Sea Ice Reaches Highest Levels Since Record Began

ABC News reports

Scientists say the extent of Antarctic sea ice cover is at its highest level since records began.

Satellite imagery reveals an area of about 20 million square kilometres covered by sea ice around the Antarctic continent.

Jan Lieser from the Antarctic Climate and Ecosystems Cooperative Research Centre (CRC) said the discovery was made two days ago.

“This is an area covered by sea ice which we’ve never seen from space before,” he said.

“Thirty-five years ago the first satellites went up which were reliably telling us what area, two dimensional area, of sea ice was covered and we’ve never seen that before, that much area.

Zerohedge adds more color and a satellite image-

In what appears to be an awkward moment of uncomfortable fact, ABC reports satellite imagery reveals an area of about 20 million square kilometres covered by sea ice around the Antarctic continent – the highest level of coverage since records began. This is the 3rd year in a row that the sea ice coverage has reached a record level – increasing at 1.5% each decade since 1979. However, there is another side to this, as the area covered in sea ice expands scientists have said the ice on the continent of Antarctica which is not over the ocean continues to deplete. The climate is changing, one way or the other.


Not only in Antarctica, but globally the temperature has not risen from where they were 15 years back, in what Acting-man blog calls The “Inconvenient Pause”.

Given that several years ago, climate scientists admitted that their models would have to be thrown out if the “pause” lasted longer than 15 years, it may be a bit surprising to find that not only have they not been thrown out, but the scaremongering in the media generally continues to be as shrill as ever, if not more so.

In the meantime, about 40 different explanations for the pause have been proposed in scientific papers. Obviously, not all 40 can be correct, but a thread that is unifying most of them has emerged: the pause is the result of, well, nature. Taking a layman’s stab at this, we would propose that the same is true of the most recent warming period and the 30 years of cooling that preceded it. In other words, human influence on the climate is probably so small as to be negligible, and there is no cause for alarm.

So why the continued alarmism? Simple: it sells. Newspapers for instance aren’t sold with stories about there being no reason to be alarmed about something or other. Alarmist scientists face some risk to their reputation, but more importantly, a huge gravy train could be derailed if they admit that they have been wrong all along. As we have previously pointed out, before the current climate alarmism really got going in the 1980s, climate science was a rather small, sparsely funded scientific niche. Along with all the other government-subsidized activities that have sprung from climate alarmism (think e.g. “green, but economically non-viable energy”) it has in the meantime grown into a quite sizable tax-payer funded racket.


Bob Hoye – Tyrannical Duncery (continued)

Global Warming Nonsense

With a degree in Geophysics this writer has found the story about man-caused global warming fascinating. Fascinating in its blatant abuse of the discipline and skepticism that has driven the advance of science for thousands of years. Even more fascinating is that since catastrophic warming displaced global cooling concerns of the 1970s it has become another great experiment in authoritarian science.
The last such example involved a bitter struggle between the establishment and the giants in science Copernicus, Galileo and Kepler. The establishment spent considerable effort to incorporate growing astronomical evidence into their Earth-centric model. They imagined crystal spheres within spheres with absolutely impossible planetary motions. Officialdom promoted absurdity. Research contrary to authoritarian dogma was not permitted.
A prosperous middle class devoured books and pamphlets that criticized the establishment’s dogma.

Something similar prevails today and today’s “deniers” should be relieved that the full censure of the movement does not include the threat of capital punishment. In which case, the continued assembly of climate data with reason and logic independent of state propaganda will win the argument.

Can the issues be described as state propaganda?

The remedy for every intense concern about global warming and climate change is an increase in taxation and regulation. By the state, of the state and for the state.
In January 2008 this writer published “Intellectual Hysteria” which reviewed the promoted ‘political science’ as well as actual science. The main points were that the CAGW movement promoted that only man-generated CO2 was the cause of global warming. It was not the first time that mankind was deemed evil. Likely it won’t be the last.
The science part was a review of climate changes based upon orbital mechanics of the solar system. Earth scientists not on the government band wagon appreciate how climate history really works.
The title “Intellectual Hysteria” was based upon the long history of charismatic intellectuals discovering a personal revelation and promoting it. Usually part of the personal concern included that there were too many people on the planet. In the politically turbulent 1860s, Stanley Jevons calculated that the supplies of coal would soon be exhausted. His book was called The Coal Question. In the politically turbulent 1790s, Thomas Malthus promoted his personal anxieties under the rubric later described as “Malthusian Catastrophe”.

Cross-post: The prestige of “Jewish Genome” papers

When, in early 2013, we covered Eran Elhaik’s paper on the origin of European Jewish population, he was treated everywhere like a pariah. Haldane’s Sieve blog, which supposedly posts all interesting pop-gen preprints from arxiv, completely ignored his paper. Yet, as it turned out over the last 18 months, Elhaik’s paper became the second most well-read among all papers published in Genome Biology and Evolution, even surpassing “On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE”. Links from our previous blog posts covering Elhaik’s research are given below.

New Study Sheds Light on the Origin of the European Jewish Population

Few Useful Links on Khazar History for Population Biologists

Population Genetics Study on Khazar Origin of Jews is Back in Spotlight

In his blog, Elhaik has written a critic of a new Jewish Genome paper, which we are cross-posting below.


Cross-post: The prestige of “Jewish Genome” papers

Every bad scientific theory consists of three parts or acts. The first part is called “The Pledge”. The researcher shows you something ordinary: a DNA from a kid who danced with semi naked dancers in his Bar mitzvah and thereby is a Jew, a DNA of men who wear tight swimsuits and have one of the most common surnames in Israel (Cohen), or just man. He shows you this item. Perhaps he asks you to inspect its raw data to see if it is indeed real, unaltered, normal. But of course…it probably is. The second act is called “The Turn”. The researcher takes the ordinary something and makes it do something extraordinary, like proving it’s the descendant of King David or Aaron the Cohen. Now you’re looking for the secret… but you won’t find it, because of course you’re not really looking. You don’t really want to know. You want to be fooled. Even if you want to challenge this, you should know that it would never get published. But you wouldn’t clap yet. Because proving something that is impossible doesn’t qualify as science even if it comes with a p-value; you have to consider the criticism. That’s why every bad scientific theory has a third act, the hardest part, the part we call “The Prestige”. (with apologies to Christopher Nolan, writer of “The Prestige”)

When I read the new Carmi et al. (Carmi et al. 2014) paper I realized it is the prestige, an end of an era where the genomes of ordinary Jews are shown to be extra ordinary. The first Jewish genome papers were noting but inordinary. Nurit Kirsh described in her excellent review (Kirsh 2003), these “Jewish Genome” type of papers. She noted that those written by Israelis and Jews attempted to justify the Zionist myth to provide a “confirmation” to the invented national Jewish identity by “proving” a common historical origin:

In half of the articles by the Israeli researchers, sources such as a history book, the Encyclopaedia Judaica, and even the Bible appear in the bibliography.

Israeli geneticists often expressed the idea that “all Jews are brothers.” Gurevitch and his coauthors compared the Ashkenazis to their Sepharadi and Oriental “brethren,”and the same term appears in their article on Moroccan and Tunisian Jews.

At the beginning of one of their articles Sheba and his colleagues quote from Genesis 43:9: “I will be surety for him; of my hand shalt thou require him.”50 In this verse Judah makes a commitment to his father, Jacob, that he will guarantee the well-being of his young brother Benjamin. Neither here nor anywhere else in the biblical chapter is there any hint of G6PD deficiency, but the quotation is a reminder that even when the researchers are talking about a population of hundreds of thousands, the reference is to a large family-a family originating from a father, his four wives, and their twelve sons. (Kirsh 2003)

Those were the days where scientific texts were as elevating as prayers. That hype did not end in the ’60s where Kirsh’s review has ended. That trend continues well into the 21st century melting together elements from science, politics, religion, and comedy. Harry Ostrer, one of the authors of (Carmi et al. 2014), offers in Legacy (Ostrer 2012) his racial theories, a new solution to the troubled Middle East, and even a hope for the return of Christ:

The evidence for biological Jewishness has become incontrovertible. (P 217)

The stakes in genetic analysis are high. It is more than an issue of who belongs in the family and can partake in Jewish life and Israeli citizenship. It touches on the heart of Zionist claims for a Jewish homeland in Israel. One can imagine future disputes about exactly how large the shared Middle Eastern ancestry of Jewish groups has to be to justify Zionist claims. (P 220)

And glorious lineages with genetic lines of descent from a king – even a Messiah – may become even more prized than the purported Cohanim modal haplotype was prized over the last decade. (P 220) (Ostrer 2012)

Don’t bother calling the people in white suits yet. More to come. The year of 2010 was no doubt the high year for “Jewish Genome” papers with a marvelous harvest of three papers in three prestigious journals. Indeed, some of us were concerned that science would ever be able to cleanse itself from these stains (Atzmon et al. 2010; Behar et al. 2010; Bray et al. 2010). It barely did, but before discussing that let us review some of the points of these papers:

Behar at al. (Behar et al. 2010) offered to distinct dark-skin Jews from pale-skin ones:

In contrast, Ethiopian Jews (Beta Israel) and Indian Jews (Bene Israel and Cochini) cluster with neighboring autochthonous populations in Ethiopia and western India, respectively, despite a clear paternal link between the Bene Israel and the Levant.

alleging they are descended of converts, thus practically excluding them from the Jewish Genome continuum. Atzmon et al. (Atzmon et al. 2010) evoked god almighty to explain their data because nothing else would:

Rapid decay of IBD in Ashkenazi Jewish genomes was consistent with a severe bottleneck followed by large expansion, such as occurred with the so-called demographic miracle of population expansion from 50,000 people at the beginning of the 15th century to 5,000,000 people at the beginning of the 19th century.

And finally, (Bray et al. 2010) managed to prove that the secret and most guarded super power of Jews was their ability to consume alcohol without getting drunk, which is particularly funny given the high rates of alcoholism in Israel (see the previous link):

Our results, together with a recent study showing that variation in the ALDH2 promoter affects alcohol absorption in Jews (48), now suggest that genetic factors and selective pressure at the ALDH2 locus may have contributed to the low levels of alcoholism.

Ah those were the days that “science” could bleach any nonsense and made our day. Those were the days that we felt we could march to Iraq, which is within the borders of the Promised Land, and demand our land back from ISIS. These are the days that we could all drink and joy in our superior intellect (see Ostrer’s book and that of Wade’s).

Then came the criticism in the form of my paper (Elhaik 2013) that pulled the carpet under this deck of cards. The paper received worldwide press coverage and is now ranked #2 most read papers in the journal of Genome Biology and Evolution, almost two years after its publication with its highlight ranked #3 (it used to be #1). (By the way, our ENCODE criticism (Graur et al. 2013) is ranked #5.) That paper showed that Jews are highly mixed with Middle Eastern, Caucasus, and Easter/Western European elements and attribute their large presence in Eastern Europe in the contribution of Khazars-converts who joined the faith around the 8th century. The paper debunked Ostrer’s divine intervention in the history of Jews as follows:

A major difficulty with the Rhineland Hypothesis, in addition to the lack of historical and anthropological evidence to the multi-migration waves from Palestine to Europe (Van Straten 2003; Sand 2009), is to explain the vast population expansion of Eastern European Jews from 50 thousand (15th century) to 8 million (20th century). The annual growth rate that accounts for this population expansion was estimated at 1.7-2%, one order of magnitude larger than that of Eastern European non-Jews in the 15th-17th centuries, prior to the industrial revolution (Van Straten 2007). This growth could not possibly be the product of natural population expansion, particularly one subjected to severe economic restrictions, slavery, assimilation, the Black Death and other plagues, forced and voluntary conversions, persecutions, kidnappings, rapes, exiles, wars, massacres, and pogroms (Koestler 1976; Van Straten 2003; Sand 2009). Because such an unnatural growth rate, over half a millennia and affecting only Jews residing in Eastern Europe, is implausible – it is explained by a miracle (Atzmon et al. 2010; Ostrer 2012). Unfortunately, this divine intervention explanation poses a new kind of problem – it is not science. The question how the Rhineland Hypothesis, so deeply rooted in supernatural reasoning, became the dominant scientific narrative is debated among scholars (e.g., Sand 2009).” (Elhaik 2013)

And so we return the paper by Carmi and colleagues (Carmi et al. 2014), now stripped of miracles or magic tricks it is nothing but dull, plodding, pedantic, and most of all boring. Carmi and colleagues manage to argue that:

“Ashkenazi Jews (AJs) are a genetic isolate close to European and Middle Eastern groups” and that “AJ [are] an even admixture of European and likely Middle Eastern origins” without realizing these are two different things. The confusion continues when they state that “(AJ), identified as Jewish individuals of Central- and Eastern European ancestry.” Of course, none of these so-called ancestries are defined in any meaningful manner that allows us to understand its geographical origin.

However, this is the unfortunate end of the single Middle Eastern origin, one of the core elements of the “Jewish Genome” paradigm and the one the same authors preached for in their previous paper.

Imagine how we would lose face to ISIS after claiming our land back waving Ostrer’s book in front of them and they will show us Ostrer’s new paper entitling us to only half the land (by Ostrer’s method). How fortunate are we for not listening to Ostrer in the first place, but then again, why should we believe him now? And what would the Palestinians say? They are probably much closer to 100% Middle Eastern than the hafling mixed-blooded Jews? Is this the end of the Jewish state? What does it mean about Jewish continuity? If are we now half Jews, who has our other half? So many [stupid] questions…

It gets worse. In their new realization, the authors precluded the demographic miracle (now called “Explosive”! how exciting) and propose a bottle neck some 600-800 years ago around the time the Judaized Khazars arrived to Europe. The authors don’t really use the K-word nor cite my paper, which would be the proper thing to do, but then again their paper is embarrassing enough as it is, so we will let it go this time. How did the author explain the massive demographic presence of Jews in Eastern Europe without invoking miracles of the Khazars? Carmi et al. now proposes to explain it with multiple expulsions! In other words, Jews were expelled from western/central European countries and ended up in Eastern Europe! How come we didn’t think about it before? Of course, many people did and had Carmi et al. actually have educated themselves about this history of the population they study they could see that the numbers don’t add up. Even in their new imaginative scenario where all expelled Jews would have travelled to Eastern Europe, there are still few million Jews short (Straten 2011).

No need to expand on the meaning of “asymmetric gene flow from Europeans into the ancestral population of AJ” estimated now at 46-50% and its devastating consequences to the purity of Jews, but let us just mention some other no less imaginative statements made by the same authors such as:

“The rate of admixture is estimated to be 0.5% per generation over the 80 generations since the founding of the Ashkenazi Jews, indicating that this group might have remained endogamous throughout much of its history and that the offspring of those who married outside were lost as members of the community10.” (Ostrer 2001)

We also note that:

  • Carmi et al. have absolutely no way to conclude that Jews have originated as Middle Eastern population that mixed European, rather than the opposite direction.
  • Carmi et al. never tested alternative scenarios of admixture, they simple choose 2-model population (perhaps due to memory limitation of the computers in Columbia University), although it doesn’t take so much to consider a more mixed model. The authors simply were not interested.

If we had any hope that the authors would grow old and wise, their figure S1 precludes such possibility. This is a somewhat typical PCA plot supposedly used to confirm Jewishness, but there is nothing typical about it. First, the authors do not say what proportion each principal component explains, it must be a really big secret. Second, the authors only used 47,713 SNPs after supposedly removing SNPs in high LD. This is very interesting because both Behar et al. (Behar et al. 2010) and myself (Elhaik 2013) applied the same approach to only half a million SNPs and found over 220,000 markers. Carmi et al. had the whole genome and yet they obtained less than a quarter of the markers we analyzed, which raises concerns that these markers were either pre selected or that the computers in Columbia University have not been upgraded for the past 20 years and cannot analyze a large number of markers. Another point of concern is the selected few populations in the plot out of >50 HGDP populations. Of course none of these questions are relevant as the sole purpose of this plot is to show that Jews are a “population isolate” floating in a sea of genetic uniqueness that is not shared by any other population. Too bad that once again, this is not science.

The Carmi et al. (Carmi et al. 2014) is an end of an era that most likely culminated with my paper although not cited (Elhaik 2013), as much as the ENCODE era has ended with our criticism, which also went uncited. Fortunately, we still have the papers from the good old days to cheer us up!


Atzmon G, et al. 2010. Abraham’s children in the genome era: major Jewish diaspora populations comprise distinct genetic clusters with shared Middle Eastern Ancestry. Am. J. Hum. Genet. 86:850-859.

Behar DM, et al. 2010. The genome-wide structure of the Jewish people. Nature. 466:238-242.

Bray SM, Mulle JG, Dodd AF, Pulver AE, Wooding S, Warren ST. 2010. Signatures of founder effects, admixture, and selection in the Ashkenazi Jewish population. Proc. Natl. Acad. Sci. USA. 107:16222-16227.

Carmi S, et al. 2014. Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nat Commun. 5:4835.

Elhaik E. 2013. The Missing Link of Jewish European Ancestry: Contrasting the Rhineland and the Khazarian Hypotheses. Genome Biology and Evolution. 5:61-74.

Graur D, Zheng Y, Price N, Azevedo RB, Zufall RA, Elhaik E. 2013. On the Immortality of Television Sets: “Function” in the Human Genome According to the Evolution-Free Gospel of ENCODE. Genome Biol Evol. 5:578-590.

Kirsh N. 2003. Population genetics in Israel in the 1950s. The unconscious internalization of ideology. Isis. 94:631-655.

Koestler A. 1976. The thirteenth tribe : the Khazar empire and its heritage. New York: Random House.

Ostrer H. 2012. Legacy: A Genetic History of the Jewish People. OUP USA.

Ostrer H. 2001. A genetic profile of contemporary Jewish populations. Nat. Rev. Genet. 2:891-898.

Sand S. 2009. The invention of the Jewish people. London: Verso.

Straten JV. 2011. The origin of Ashkenazi Jewry : the controversy unraveled. New York: Walter de Gruyter.

Van Straten J. 2003. Jewish Migrations from Germany to Poland: the Rhineland Hypothesis Revisited. Mankind Quarterly. 44:367-384.

Van Straten J. 2007. Early Modern Polish Jewry The Rhineland Hypothesis Revisited. Hist. Methods. 40:39-50.

T. Ryan Gregory Starts ‘Evolution Consulting Service’ after Reading Gibbon Paper in Nature

T. Ryan Gregory, who had been fighting ENCODE’s junk science since 2007, is fed up with another high-profile paper published in Nature. He wrote in Twitter –



The paper in question has over seventy authors, which tends to scare us these days. More authors mean higher chance of ‘faddish stuff’ or ‘fashionable ideas’ getting promoted to the abstract, and higher risk of being humiliated by Dan Graur sooner or later.

Gibbon genome and the fast karyotype evolution of small apes

Gibbons are small arboreal apes that display an accelerated rate of evolutionary chromosomal rearrangement and occupy a key node in the primate phylogeny between Old World monkeys and great apes. Here we present the assembly and analysis of a northern white-cheeked gibbon (Nomascus leucogenys) genome. We describe the propensity for a gibbon-specific retrotransposon (LAVA) to insert into chromosome segregation genes and alter transcription by providing a premature termination site, suggesting a possible molecular mechanism for the genome plasticity of the gibbon lineage. We further show that the gibbon genera (Nomascus, Hylobates, Hoolock and Symphalangus) experienced a near-instantaneous radiation ~5 million years ago, coincident with major geographical changes in southeast Asia that caused cycles of habitat compression and expansion. Finally, we identify signatures of positive selection in genes important for forelimb development (TBX5) and connective tissues (COL1A1) that may have been involved in the adaptation of gibbons to their arboreal habitat.

We asked Ryan by email, what he saw wrong in the paper and he was kind enough to explain. Ken Weiss and co-authors often touch on similar flaws in evolutionary thinking in their interesting blog.

It’s just very basic “tree thinking”, or correctly interpreting a phylogeny. In the case of the gibbon paper, they start off with the following statement:

“In the primate phylogeny, gibbons diverged between Old World monkeys and great apes, providing a unique perspective from which to study the origins of hominoid characteristics.”

This is nonsensical. If you look at a phylogeny of primates, the Old World monkeys are the outgroup to the clade that includes gibbons and great apes. This means: a) any member of that clade is equally closely related to Old World monkeys — gibbons are not closer than humans are to Old World monkeys, b) both the great apes and gibbons are descended from a common ancestor that is not shared by Old World monkeys — it is therefore equally (in)accurate to say that great apes diverged between Old world monkeys and gibbons. Also, these are all modern species. The fact that gibbons are the outgroup to great apes absolutely does not imply that they are similar to the ancestor of the entire clade. Early branching does not equal primitive.

If you are not willing to pay for the measly sum Ryan is charging, he offers a free DIY solution – to read his 2008 paper.

Understanding Evolutionary Trees

Charles Darwin sketched his first evolutionary tree in 1837, and trees have remained a central metaphor in evolutionary biology up to the present. Today, phylogenetics—the science of constructing and evaluating hypotheses about historical patterns of descent in the form of evolutionary trees—has become pervasive within and increasingly outside evolutionary biology. Fostering skills in “tree thinking” is therefore a critical component of biological education. Conversely, misconceptions about evolutionary trees can be very detrimental to one’s understanding of the patterns and processes that have occurred in the history of life. This paper provides a basic introduction to evolutionary trees, including some guidelines for how and how not to read them. Ten of the most common misconceptions about evolutionary trees and their implications for understanding evolution are addressed.

E. coli Nanopore Data Release and #baltiandbioinformatics Nanopore Conference

Nick Loman uploaded his E. coli nanopore data on GigaDB.

Here we present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION device with R7 chemistry during the early-access MinION Access Program (MAP).

Three sequencing runs of the MinION were performed using R7 chemistry. The first run produced 43,656 forward reads (272Mb), 23,338 (125Mb) reverse reads, 20,087 two-direction (2D) reads (131Mb), of which 8% (10Mb) were full 2D. Full 2D reads means that the complementary strand was successfully slowed through the pore.

The R7 protocol has a modification to increase the relative number of full 2D reads (NONI). To exploit this, two new libraries were produced which included an overnight incubation stage (ONI-1 and ONI-2). Each library was run on an individual flow cell. This resulted in 6,534 & 8,260 forward reads, 2,171 & 2,945 reads and 1,740 & 2,394 2D reads (27%, 29%). In this case, 50% and 41.8% of reads were full 2D respectively. The mean fragment lengths for 2D reads from the three libraries were 6,543 (NONI) 6,907 (ONI-1) and 6,434 (ONI-2).

We have taken the reads, created a kmer distribution and then merged that distribution with the kmers from E. coli reference genome (MG1655-K12.fasta). The kmer distribution file for reads can be downloaded from here and genome map can be downloaded from here. A snapshot is shown below:

(column 1: 12mer, column 2: genomic loc, column 3: redudancy in genome, column 4: kmer count in Nick Loman’s data)


Those two files should tell you a lot more about the technology and its potential than many talks, papers and blog posts. So, readers are encouraged to take a look.

Few observations:

1. Good match with genome in most locations

Based on kmer distribution of reads, we checked for kmer coverage of genome and found kmers present from almost all locations of the genome.

2. Poly-A regions are missed.

If point 1 gives you any confidence in the data (or our analysis), you are fooled. Please note that given the number of reads and short length of 12mers, completely noisy data are also expected to have almost all kmers from the genome. However, complete lack of k-mers from some parts of genome is informative and can be definitely ascribed to base-calling errors. We found a few such missing regions, and almost always they had poly-A stretches.


Here is the opposite extreme – genomic regions with too many unexpected kmers in the reads.


3. Kmer distribution – read vs ref

If almost every kmer is expected to appear among the reads, how can we tell whether they contain any information? The following figure could be of help. Dotted line shows occurrence frequency of all kmers from the reads, and solid line shows the same for only those kmers that actually appear in the genome. The solid peak is shifted to the right and that means the kmers in the genome appear more frequently in the reads than random kmers.


4. Unusual kmers in the reads, not from the genome

The reads also have some kmers appearing at high frequency, which are not at all present in the genome.


Analysis 1-4 should be helpful in choosing an appropriate aligner rather than shooting blindly. We will add more, as we analyze data in more detail.


On #baltiandbioinformatics Nanopore Conference

1. Very impressive. This is the future !!

‘This’ refers to Google hangout way of running conferences, which anyone can attend from his/her bedroom and Bill Clinton with his cigar :)


Scott Edmunds’s GigaDB, which we covered earlier, turns out to be another tech winner in the publishing world.

2. Nanopore Technology

Two highlights –

a) SPAdes assembled E. coli using nanopore+ILMN and got good results. Note – they needed ILMN and so did Schatz.

b) Error rate figure –


Clive Brown’s talk is very impressive and he is beaming with infectious optimism. Hopefully Loman’s tests will be able to test how fast that infection spreads :).

His vision is right, but bioinformatics remains the weak point in this project as explained next.

3. Cost of sequencing

If each nanopore stick costs $1000, cost of sequencing is high compared to Pacbio (and Illumina). The mobility nanopore provides is truly revolutionary, but for his vision to work, the price point for the sticks has to be way lower.

We have no idea what the manufacturing costs for those sticks are, but let us assume that he can provide them for $10 a pop. In that case, to maintain a cash flow of $100M or higher (to justify the valuation), data management and analysis has to be the strong point of the company. So, essentially ONT becomes a bioinformatics company.



Interesting viewpoint of the CEO of company that wants to pass itself as a ‘small startup’ from a garage.


Nick Loman and Clive Brown’s Nanopore Presentation – Live Feed

Tune in at 9AM EST here.

Here are some guesses on the technology based on what we are hearing from everywhere. Those could be good questions to ask.

1. Error rate:
This technology possibly has ~25-30% error rate. We say that based on Michael Schatz’s twitter comment about yeast assembly added in the previous post. Typically, all-on-all Pacbio assembly takes about 50X coverage, whereas Schatz used 100X for his nanopore-based yeast assembly. We also know from previous talks and discussions that the errors are from indel.

However, high error rate is not something to be ashamed of, if the technology delivers sufficiently long reads. In fact, this technology may turn out to be bioinformaticians’ paradise, and they can start all the way from base-calling. Curious readers may check useful comment by gasstationwithpumps in this post – Kalman Filter – a Way to Reconstruct Signal from Noisy Data (as in Nanopore).

2. Proper alignment program:
Apparently BLAST and BLASR are not appropriate alignment programs, but LASTZ works. How is LASTZ different from the other two programs?

LASTZ was written by R. S. Harris, a PhD student of Webb Miller, who himself was a collaborator of Eugene Myers. During late 90s, Myers joined Celera, while Miller became a close collaborator of Ross Hardison to work on, yes you heard it right, ENCODE. LASTZ was written to align genomes of organisms as different as human and mouse, and the seeding section of manual written by Harris gives you the main difference from BLAST/BLASR.

–seed=12of19 T=1 or T=2 Seeds require a 19-bp word with matches in 12 specific positions (1110100110010101111).
–seed=14of22 T=3 or T=4 Seeds require a 22-bp word with matches in 14 specific positions (1110101100110010101111).

Once again, this points to higher than expected error rate. Maybe another good program exists for aligning those reads, as a cursory reading of blog would suggest. Let us leave at that for the time being :)

3. OmicsOmics Blog Reanalyzes Mikheyev/Tin Data

Major conclusion is in bold.

Summaries of the alignment statistics can be found in Table 2 of Mikheyev & Tin. For BLASR, they aligned 3.4K (12%) of the 1D reads, for 0.65Mb of bases aligned to the lambda reference.. This yields a coverage of 13.5X. For 2D, the statistics aren’t much different: 0.9K reads aligned yielding 0.81Mb of aligned bases. With the LASTAL approach, 8.8K of the 1D reads aligned to yield 33.3Mb of aligned bases. Similarly, for the 2D data, 3.6K reads aligned yielding 6.6Mb of aligned bases. So the improved alignment procedures increased the number of aligned 1D data by about 51X and the amount of 2D data by about 8.1X. These numbers are conservative; I kept only the longest alignment for each read, which precludes some remaining cases in which two alignments are made which plausibly represent a central region the which failed to yield an alignment (I do apologize for some different numbers I tweeted earlier; I let my excitement get the best of me and the 2D number was quite off).

Full program is here.

The popular Balti and Bioinformatics series returns, this time in virtual space! Join us for a nanopore-themed programme which will be streamed from Google Hangouts to YouTube.

Draft schedule (subject to change)

Times are British Summer Time (GMT+1)

Start: 14:00 BST, 13:00 GMT, 09:00 Eastern Standard Time

14.00 – Intro
14.05 – Clive Brown, Nanopore sequencing
14.45 – Nick Loman, Early data from nanopore sequencing: bioinformatics opportunities and challenges
15.15 – Matt Loose, Streaming data solutions for nanopore
15.30 – Josh Quick, Nanopore sequencing in outbreaks
15.45 – Torsten Seemann, Awesome pipelines for microbial genomics
16.15 – Finish

Tutorial on David Tse’s Work on DNA/RNA Assembly Presented at ISIT 2014

We previously covered work from David Tse’s group (Optimal Assembly for High Throughput Shotgun Sequencing) and a number of his other papers are available from arxiv.


They are math-intensive, but above tutorial slides (click on image to access) forwarded by reader Asif present them in simple language. Here is Asif’s note –

This is a set of tutorial slides presented at the International Symposium on Information Theory a few months ago. They summarize the work from David Tse’s group, including the papers linked below. The papers are somewhat technical, and problem is also complex. So the slides are helpful to highlight the main results and the approach taken without the burden of being as precise as the submissions to, but with a faithful representation of the ideas.

Long Term Direction of Nanopore Sequencing – Three Ways from Here

An infinite amount of propaganda being spread about Oxford Nanopore really troubles the lowly janitors like us. We are not sure why this company and its ‘fanboys’ operate with innuendos passed around social media channels and not deliver any real information like other respectable companies would do. Hopefully Nick Loman’s presentation tomorrow will be backed by release of some real data, but until then scientists’ job is not to cheer-lead for companies, but find out the truth and represent it faithfully and accurately. The scientific community is failing to play its proper role just like they failed to debunk Ewan Birney’s misleading media campaign about ‘killing the junk DNA’ (@ENCODE_NIH). In fact, in case of ENCODE, scientists were so married to the propaganda from Birney and friends that even reputed journals attacked Dan Graur for simply telling the truth.

We sincerely hope that the ONT saga is not a repeat of that experience (despite the involvement of disgraced scientist Ewan Birney in both), but let us deconstruct some of the myths surrounding nanopore sequencing.


Myth 1. “Oxford Nanopore is a small company.”

BioMickWatson is trying to spread this myth through his blog post posted over the weekend.

According to Bloomberg news, the company had a valuation of $1.5B last year and the valuation must have gone up during last month’s fundraising.
Continue reading Long Term Direction of Nanopore Sequencing – Three Ways from Here

WABIstan Report

The 14th International Workshop on Algorithms in Bioinformatics (WABI) takes place in Wroclaw, Poland between September 8–10, 2014. A few days back, a reader forwarded us the official link to its proceedings. Thankfully, we managed to download it in time, because Springer decided to change access policy for the Proceedings. Here is a brief summary of some of the papers.

The proceeding included 25+ papers. We already wrote about Gene Myers’ submission, but here is the entire table of contents.

QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Matteo Comin, Andrea Leoni, and Michele Schimd

Abstract. The data volume generated by Next-Generation Sequencing
(NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads.
Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15%). Thus it will be fundamental to exploit quality value information within the alignment-free framework. In this paper we present a family of alignment-free measures, called Dq-type, that incorporate quality value information and k-mers counts for the comparison of reads data. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. These measures are implemented in a software called QCluster (

Improved Approximation for the Maximum Duo-Preservation String
Mapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Nicolas Boria, Adam Kurpisz, Samuli Lepp¨anen, and
Monaldo Mastrolilli

A Faster 1.375–Approximation Algorithm for Sorting by
Transpositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Lu´ıs Felipe I. Cunha, Luis Antonio B. Kowada, Rodrigo de A. Hausen, and Celina M.H. de Figueiredo

A Generalized Cost Model for DCJ-Indel Sorting . . . . . . . . . . . . . . . . . . . . . 38
Phillip E.C. Compeau


Efficient Local Alignment Discovery amongst Noisy Long Reads. . . . . . . . 52
Gene Myers

Efficient Indexed Alignment of Contigs to Optical Maps . . . . . . . . . . . . . . 68
Martin D. Muggli, Simon J. Puglisi, and Christina Boucher

Navigating in a Sea of Repeats in RNA-seq without Drowning . . . . . . . . . 82
Gustavo Sacomoto, Blerina Sinaimeri, Camille Marchet,
Vincent Miele, Marie-France Sagot, and Vincent Lacroix

Linearization of Median Genomes under DCJ . . . . . . . . . . . . . . . . . . . . . . . . 97
Shuai Jiang and Max A. Alekseyev

An LP-Rounding Algorithm for Degenerate Primer Design . . . . . . . . . . . . 107
Yu-Ting Huang and Marek Chrobak

GAML: Genome Assembly by Maximum Likelihood . . . . . . . . . . . . . . . . . . 122
Vladim´ır Boˇza, Broˇna Brejov´a, and Tom´aˇs Vinaˇr

A Common Framework for Linear and Cyclic Multiple Sequence
Alignment Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Sebastian Will and Peter F. Stadler

Entropic Profiles, Maximal Motifs and the Discovery of Significant Repetitions in Genomic Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Laxmi Parida, Cinzia Pizzi, and Simona E. Rombo

Estimating Evolutionary Distances from Spaced-Word Matches . . . . . . . . 161
Burkhard Morgenstern, Binyao Zhu, Sebastian Horwege, and
Chris-Andr´e Leimeister

On the Family-Free DCJ Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
F´abio Viduani Martinez, Pedro Feij˜ao, Mar´ılia D.V. Braga, and
Jens Stoye

New Algorithms for Computing Phylogenetic Biodiversity . . . . . . . . . . . . . 187
Constantinos Tsirogiannis, Brody Sandel, and Adrija Kalvisa

The Divisible Load Balance Problem and Its Application to
Phylogenetic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Kassian Kobert, Tom´aˇs Flouri, Andre Aberer, and
Alexandros Stamatakis

Multiple Mass Spectrometry Fragmentation Trees Revisited: Boosting Performance and Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Kerstin Scheubert, Franziska Hufsky, and Sebastian B¨ocker

An Online Peak Extraction Algorithm for Ion Mobility Spectrometry
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Dominik Kopczynski and Sven Rahmann

Best-Fit in Linear Time for Non-generative Population Simulation
(Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Niina Haiminen, Claude Lebreton, and Laxmi Parida


GDNorm: An Improved Poisson Regression Model for Reducing Biases
in Hi-C Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Ei-Wen Yang and Tao Jiang

Pacemaker Partition Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Sagi Snir

Abstract. The universally observed conservation of the distribution of
evolution rates across the complete sets of orthologous genes in pairs of
related genomes can be explained by the model of the Universal Pacemaker
(UPM) of genome evolution. Under UPM, the relative evolutionary
rates of all genes remain nearly constant whereas the absolute rates
can change arbitrarily. It was shown on several taxa groups spanning the
entire tree of life that the UPM model describes the evolutionary process
better than the traditional molecular clock model [26,25]. Here we extend
this analysis and ask: how many pacemakers are there and which genes
are affected by which pacemakers? The answer to this question induces
a partition of the gene set such that all the genes in one part are affected
by the same pacemaker. The input to the problem comes with arbitrary
amount of statistical noise, hindering the solution even more. In this
work we devise a novel heuristic procedure, relying on statistical and geometrical
tools, to solve the pacemaker partition identification problem
and demonstrate by simulation that this approach can cope satisfactorily
with considerable noise and realistic problem sizes. We applied this procedure
to a set of over 2000 genes in 100 prokaryotes and demonstrated
the significant existence of two pacemakers.

Manifold de Bruijn Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Yu Lin and Pavel A. Pevzner

Abstract. Genome assembly is usually abstracted as the problem of
reconstructing a string from a set of its k-mers. This abstraction naturally leads to the classical de Bruijn graph approach—the key algorithmic technique in genome assembly. While each vertex in this approach is labeled by a string of the fixed length k, the recent genome assembly studies suggest that it would be useful to generalize the notion of the de Bruijn graph to the case when vertices are labeled by strings of variable lengths. Ideally, we would like to choose larger values of k in
high-coverage regions to reduce repeat collapsing and smaller values of k in the low-coverage regions to avoid fragmentation of the de Bruijn graph. To address this challenge, the iterative de Bruijn graph assembly (IDBA) approach allows one to increase k at each iterations of the graph construction. We introduce the Manifold de Bruijn (M-Bruijn) graph (that generalizes the concept of the de Bruijn graph) and show that it can provide benefits similar to the IDBA approach in a single iteration that considers the entire range of possible k-mer sizes rather than varies k from one iteration to another.

Constructing String Graphs in External Memory . . . . . . . . . . . . . . . . . . . . . 311
Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola,
Marco Previtali, and Raffaella Rizzi

Abstract. In this paper we present an efficient external memory algorithm to compute the string graph from a collection of reads, which is a fundamental data representation used for sequence assembly. Our algorithm builds upon some recent results on lightweight Burrows-Wheeler Transform (BWT) and Longest Common Prefix (LCP) construction providing, as a by-product, an efficient procedure to extend intervals of the BWT that could be of independent interest. We have implemented our algorithm and compared its efficiency against SGA—the most advanced assembly string graph construction program.

Topology-Driven Trajectory Synthesis with an Example on Retinal Cell Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Chen Gu, Leonidas Guibas, and Michael Kerber

Abstract. We design a probabilistic trajectory synthesis algorithm for generating time-varying sequences of geometric configuration data. The algorithm takes a set of observed samples (each may come from a different trajectory) and simulates the dynamic evolution of the patterns in O(n2 log n) time. To synthesize geometric configurations with indistinct identities, we use the pair correlation function to summarize point distribution, and α-shapes to maintain topological shape features based on a fast persistence matching approach. We apply our method to build a computational model for the geometric transformation of the cone mosaic in retinitis pigmentosa — an inherited and currently untreatable retinal degeneration.
A Graph Modification Approach for Finding Core–Periphery Structures in Protein Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Sharon Bruckner, Falk H¨uffner, and Christian Komusiewicz

Abstract. The core–periphery model for protein interaction (PPI) networks assumes that protein complexes in these networks consist of a dense core and a possibly sparse periphery that is adjacent to vertices in the core of the complex. In this work, we aim at uncovering a global core–periphery structure for a given PPI network. We propose two exact graph-theoretic formulations for this task, which aim to fit the input network to a hypothetical ground truth network by a minimum number of edge modifications. In one model each cluster has its own periphery, and in the other the periphery is shared. We first analyze both models from a theoretical point of view, showing their NP-hardness. Then, we devise efficient exact and heuristic algorithms for both models and finally perform an evaluation on subnetworks of the S. cerevisiae PPI network.

Interpretable Per Case Weighted Ensemble Method for Cancer
Associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Adrin Jalali and Nico Pfeifer

Abstract. Over the past decades, biology has transformed into a high throughput research field both in terms of the number of different measurement techniques as well as the amount of variables measured by each technique (e.g., from Sanger sequencing to deep sequencing) and is more and more targeted to individual cells [3]. This has led to an unprecedented growth of biological information. Consequently, techniques that can help researchers find the important insights of the data are becoming
more and more important. Molecular measurements from cancer
patients such as gene expression and DNA methylation are usually very noisy. Furthermore, cancer types can be very heterogeneous. Therefore, one of the main assumptions for machine learning, that the underlying unknown distribution is the same for all samples in training and test data, might not be completely fulfilled.


Reconstructing Mutational History in Multiply Sampled Tumors Using Perfect Phylogeny Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
Iman Hajirasouliha and Benjamin J. Raphael

Abstract. High-throughput sequencing of cancer genomes have motivated the problem of inferring the ancestral history of somatic mutations that accumulate in cells during cancer progression. While the somatic mutation process in cancer cells meets the requirements of the classic Perfect Phylogeny problem, nearly all cancer sequencing studies do not sequence single cancerous cells, but rather thousands-millions of cells in
a tumor sample. In this paper, we formulate the Perfect Phylogeny Mixture problem of inferring a perfect phylogeny given somatic mutation data from multiple tumor samples, each of which is a superposition of cells, or “species.” We prove that the Perfect Phylogeny Mixture problem is NP-hard, using a reduction from the graph coloring problem. Finally, we derive an algorithm to solve the problem.

We will cover a few papers in future commentaries, but here is a comment about WABIstan in general. We do not understand, why its organizers decided to turn it into a ‘land-locked country’ by removing all aspects of asking biological questions from the definition of bioinformatics. If bioinformatics is redefined as asking questions about how living systems function using computers, the conference (and the field of bioinformatics) would benefit.

Open Science Fail – Oxford Nanopore Kicks Out Alex Mikheyev from its MAP Program

Yesterday we wrote about an initial report titled “A first look at the Oxford Nanopore MinION Sequencer”. The author detailed various steps of nanopore sequencing and presented results from his samples. In twitter, some others criticized his bioinformatics methods arguing that the alignment tools he used were responsible for his observation of high percentage of misaligned reads. Thankfully, the author also uploaded all his raw data on to github for anyone else to check.

We thought that was a triumph of open science, where any apparently incorrect observation would be quickly corrected by the community. We are yet to see anyone redoing the alignments with suggested proper procedures, and instead the saga has taken a different turn. Dr. Mikheyev wrote to us that he was kicked out of ONT MAP program with the following message.

We therefore can only conclude that your objectives and outlook are fundamentally misaligned with those of the MAP programme and the other participants.

Not sure whether it is intentional, but the message seems to add another dimension to the ‘misalignment’ issues. We contacted Oxford Nanopore to confirm the veracity of the above account, but have not heard back yet.



Here are a few clarifications on the above issue, based on relevant twitter discussions.




Merging of Multi-String BWTs with Applications

Code site

Msbwt is a package for combining strings from sequencing into a data structure known as the multi-string Burrows Wheeler Transform (MSBWT). This structure allows for querying a k-mer in O(k) time regardless of how many strings are present in the MSBWT. This particular package was created originally for merging MSBWTs after creation. In short, this allows for multiple datasets to be combined into a single structure which allows for queries over both data sets simultaneously.

Included in this package are implementations of MSBWT creation algorithms as described by Bauer et al. in “Lightweight BWT construction for very large string collections” for different file types. Furthermore, the algorithm for merging these MSBWTs from Holt and McMillan in “Merging of Multi-String BWTs with Applications” is also implemented.

Currently, some rudimentary query utilities are also in place on the command line interface. However, we recommend to anyone wishing to perform some dynamic/interactive queries to use the API provided by the source code.

For common usage, please type after installation:

msbwt -h
Detailed Usage
msbwt cffq
Create From FastQ (cffq) will create a MSBWT from a list of supplied FastQ files. This method uses the “column-wise” approach described by Bauer et al. in “Lightweight BWT construction for very large string collections”. The algorithm is not computationally challenging, but requires a significant amount of disk I/O. This command is a combination of two subcommands “pp” and “cfpp”.

-p numProcesses – the number of processes supplied to the algorithm
-u, –uniform – indicates that all reads in the FastQ files have uniform length (faster than non-uniform)
msbwt pp
Pre-Process will perform the first stage of the MSBWT construction algorithm. Specifically, it extracts all reads from a list of FastQ files and stores them in a “column-wise” manner to be processed later. It creates a series of seq files (which can be deleted after creating the full MSBWT) that are used for the “cfpp” function call.

-u, –uniform – indicates that all reads in the FastQ files have uniform length (faster than non-uniform)
msbwt cfpp
Create From Pre-Process is the second stage of the MSBWT construction algorithm. Specifically, it takes the extracted reads (in a different format from “pp” function) and builds the MSBWT in a “column-wise” manner. The final result is a msbwt.npy file in the supplied directory.

-p numProcesses – the number of processes supplied to the algorithm
-u, –uniform – indicates that all reads in the FastQ files have uniform length (faster than non-uniform)
msbwt merge
This is an implementation of the merge algorithm described by Holt and McMillan in “Merging of Multi-String BWTs with Applications”. Specifically, it takes as input multiple MSBWT and interleaves them together such that the final result includes all strings from the inputs. It also generates an interleave vector used to identify the original sources of each string. Currently, the maximum number of inputs is 256.

-p numProcesses – the number of processes supplied to the algorithm
msbwt query
This provides command line access to the most basic of queries that can be provided to a MSBWT. Given a MSBWT and k-mer, it will return the number of times that k-mer occurs in the string collection.

-d, –dump-seqs – Modifies the output to also dump each string containing the k-mer and its dollar ID
msbwt massquery
This provides command line access to the basic k-mer count for a list of provided k-mers. It will take as input a file containing line separated k-mers and return a list of counts. Output is returned in CSV format of k-mer, counts.

-r, –rev-comp – Modifies the output to also return reverse complemented k-mer counts in the CSV
msbwt compress
This will take a MSBWT and compress it using a rudimentary run-length encoding. The compressed structure can still be search, but auxiliary structures will have to be built (automatically) before the searches take place.

-p numProcesses – the number of processes supplied to the algorithm
msbwt decompress
This will take a compressed MSBWT and decompress it back to byte-per-symbol encoding. Auxiliary structures will have to be built (automatically) before the searches take place.

-p numProcesses – the number of processes supplied to the algorithm


Motivation: The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data is rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data, as well as, the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.

Results: We present a novel algorithm that merges multi-string BWTs in O(LCS*N) time where LCS is the length of their longest common substring between any of the inputs and N is the total length of all inputs combined (number of symbols) using O(N * log2(F)) bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared to the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.

Availability: The MSBWT package is available through PyPI with source code located at