Archives

Categories

Y
Y

In DALIGN Paper, Gene Myers Delivers a Major Blow to His Biggest Competitor

polls_Boxers_2_5840_794020_answer_1_xlarge

While Gene Myers was busy working on neural images of fruitflies, he noticed the bioinformatics land being taken over by search algorithm originating from a ‘competitor’. Now he comes back with an outstanding paper (DALIGN) to show why he is still the master.

That competitor of Gene Myers is none other than Gene Myers, 20 years younger. In early 90s, he and Udi Manber (currently leading the search group at Google) introduced the concept of suffix arrays. Fast forward by two decades, every nucleotide search program for new sequencing technologies uses suffix arrays, combined with BWT transform and FM indices. We covered that history in an earlier commentary (check this blog post from 2011).

Myers tried to use BLASR, which incorporated suffix arrays, for aligning PacBio reads and found it to be too slow. The troubles come from memory bandwidth and cache incoherence. Modern processors have three levels of cache (L1, L2, L3) and the speed difference between having all data in L1 cache vs DRAM can be 100x. BWT-based FM index algorithm makes effectively random accesses to memory locations and therefore the search process gets highly dependent on DRAM access. Especially when one tries to scale BWT-based alignment programs with multiple cores, memory bandwidth becomes the major bottleneck. Our readers are well familiar with that drawback (Check “The Future of Computers – Multicore and the Memory Wall”).

The algorithm proposed by Myers overcomes that bottleneck in two steps. They are so elegant that even the memory-heavy part of his algorithm sees 3.6x scaleup in a 4-core machine, whereas the less memory-intensive part sees 3.88x scaleup. The steps are – (i) finding read pairs highly likely to align based on matching seeds and (ii) alignment. Even though they sound innocuous, the beauty is in the details. The memory-intensive rapid seed detection part introduces a highly cache-coherent radix sort algorithm that attempts to keep relevant bits in L1 cache. The rapid local alignment part uses Myers’ own O(nd) algorithm from mid-80s (see “How Does the Unix Diff Algorithm Work?”) , but with several optimization steps to limit the rapidly expanding parallel waves at every step of search. Especially it makes use of random distribution of errors in PacBio reads.

What next after this formidable accomplishment of speeding up PacBio alignment 25X over BLASR and thus making large assemblies possible? There are rumors that Gene Myers plans to demonstrate one or more of the following three – (i) walk over water on river Elbe, (ii) design a perfect assembler, (iii) finish the zebrafish genome. Given how difficult (ii) and (iii) turned out to be so far, we believe he will go for (i).

If, instead, Myers plans to finish Danio and thousands of other recently published incomplete genomes, he needs to hurry up. We heard that Udi Manber, who is currently at Google, is working on a different solution for those bad genomes. Google plans to replace all “NNNNNN” regions with targeted ads, and there is a possibility that Google will buy a short-read company, because more gaps, the better.

>chr1
atgtcgcggcgcaatgggaggagatttaaaNNNNNNN.....Get..2%interest...rate...for..your...mortgage.........NNNNNNNttccttctttaaattattaggtgcgcgctagcgatcgcggtattatataNNNNNNNNNNNNN.....Domino's....Free...Pizza....Delivery.........NNNNNNNccattgccaggtccggcgctaggcgctagcgcgctaggcgcgcgcccctatgagga

The full article by Myers will appear in WABI 2014 (Workshop on Algorithms in Bioinformatics), Sept 8-10, Wroclaw, Poland. In the meanwhile, readers interested to know more about his developments can check his blog. The code of DALIGNER is available from github.

Genome Assembly – Discussion on Standardization of How to Represent Assembled ‘Genome’

A few days back, Heng Li opened up a new discussion on -

A proposal of the Grapical Fragment Assembly format

Introduction
Almost three years ago, there was a lengthy discussion in the Assemblathon mailing list about a generic format for fragment assemmbly. The end product is the FASTG format. In the discussion, I have expressed several major concerns with the format. The top one is that it is mathematically wrong. Three years later, FASTG is still not widely used. At this point, Adam Phillippy and Pall Melsted openly called for a generic assembly format again. I also feel the pressing necessity of standardization, so decided to give a try myself. This is the Graphical Fragment Assembly format, or GFA in abbreviation.

In this post, I will start from the theoretical basis of assembly graph, describe the format and finally discuss the potential issues with the proposal.

I showed an earlier version of this format to Richard Durbin, Daniel Zerbino and Benedict Paten last night in Oxford. That version was a variant of FASTA. When I was formalizing the format in this post, I found FASTA is too crowded and too limited. Following the suggestion of Daniel, I finally adopted a format similar to ASQG and the PSMC output.

Theory
DNA sequence assembly is often (though not always) represented as a graph. There are multiple types of graphs including de Bruijn graph, overlap graph, unitig graph and string graph. They are all birected graph. Briefly, in this graph, each vertex is a sequence and each arc is an overlap. Because DNA sequences have two strands, an arc may have four directions, representing the four possible overlaps: forward-forward, forward-reverse, reverse-forward and reverse-reverse. It should be noted that a k-mer de Bruijn graph is equivalent to an overlap graph for k-mer reads with (k-1)-mer overlaps. It is a bidirected graph, too.

The critical problem with FASTG is that it puts sequneces on arcs/edges. It is unable to describe a simple topology such as A->B; C->B; C->D without adding a dummy node, which breaks the theoretical elegance of assembly graphs. Due to the historical confusion between vertices and edges, I will avoid using these terminologies. I will use a segment for a piece of sequence and a link for a connection between segments.

Also, a follow up post is here -

First update on GFA

I was out of the town in the past few days, so have not been able to focus on GFA. Now I am back to work to give the first update on the format based on the comments from many people, which I appreciate a lot.

In comparison to my initial proposal, the first and the major change is to name segments instead of the ends of segments. This seems the consensus so far. Secondly, I am thinking to move the quality field on the “S” line to an optional tag. Not many assemblers produce quality or per-base read depth. Thirdly, more people prefer to explictly encode bubbles as multiple segments, rather than inline them in the sequence. I will use explicit bubbles at least in the initial iteration.

Here is the graph from the previous post in the updated format:

Here is a comment from Jason Chin -

Some thought about Heng Li’s proposal for assembly graph format http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/

some quick comments.

Is this format trying represent the raw overlaps or finally assembly graph or both?

It seems to me that it is more suitable for the first. In the work to represent diploid genome assembly, I had to do multiple level of reduction of the graph from the initial string/overlap graph to simply the problem. if we are looking at a more reduced assembly, we might have to deal with edges corresponding to unitigs with the same in and out nodes. In this format, such bubble paths (difference between them bigger than small indel) will be in different row, the behavior of such edges with the same in and out node should be defined. What I did for diploid work is to assign uid for each edges.

Also, I do think the final assembly should avoid the bidirectional edges. It should be resolved by the assembler. From pragmatic point, it will confuse a lot of biologists.

A number of additional comments and suggestions can be seen in Heng Li’s thread. Please feel free to add.

How Does the Unix Diff Algorithm Work?

A link from stackoverflow says -

The basic algorithm is described in “An O(ND) Difference Algorithm and its Variations”, Eugene W. Myers, ‘Algorithmica’ Vol. 1 No. 2, 1986, pp. 251-266; and in “A File Comparison Program”, Webb Miller and Eugene W. Myers, ‘Software–Practice and Experience’ Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in “Algorithms for Approximate String Matching”, E. Ukkonen, `Information and Control’ Vol. 64, 1985, pp. 100-118.

We checked Myers’ original paper and found an elegant graph-based greedy algorithm.

The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N+D2 ) expected-time performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D2 ) time variation.

The paper is well-written, but for those looking for additional explanation may find the following website helpful -

Investigating Myers’ diff algorithm: Part 1 of 2

Investigating Myers’ Diff Algorithm: Part 2 of 2

In bioinformatics sequence comparison, can we start doing ‘diff’ between sequences instead of going through other elaborate search programs? Such as taking 1 million long pacbio reads and doing ‘diff’ between pairs to see whether they are closely related?

More on that in a later commentary.

Insane Decision of the Day – Broad Institute Gets $650M to Cure Insanity !!

Over the last few months, I have been working on a rebuttal paper in the field of genetics + psychology and found an unbelievable number of garbage papers from well-known universities. After carefully pondering about what was going on, I came to the conclusion that too much money had been chasing too few ideas in the field. Abundance of money is the root cause of irreproducibility of much of the new era science. The areas of research flush with stupid money are seeing the highest recall rates.

Now to make things worse -

$650M gift to Broad seeks to propel psychiatric research

The Broad Institute today announced an unprecedented commitment of $650 million from philanthropist Ted Stanley aimed at galvanizing scientific research on psychiatric disorders and bringing new treatments based on molecular understanding to hundreds of millions of people around the world.

The Stanley commitment — the largest ever in psychiatric research and among the largest for scientific research in general — will support research by a collaborative network of researchers within the Stanley Center for Psychiatric Research at the Broad Institute, a biomedical research institution that brings together faculty from the Massachusetts Institute of Technology, Harvard University, the Harvard-affiliated hospitals, and collaborators worldwide.

We are back to the old medieval alchemy mentality: “The defining objectives of alchemy are varied but historically have typically included one or more of the following goals: the creation of the fabled philosopher’s stone; the ability to transform base metals into the noble metals (gold or silver); and development of an elixir of life, which would confer youth and longevity.”

Those looking for variety of opinions can check the comment section of ‘In the Pipeline’ blog covering this topic -
The Broad Gets $650 Million For Psychiatric Research

‘Epigenetics’ Explained, in the Context of Descendants of Holocaust and Slavery Victims

A renowned psychologist asked us by email:

It has been posited that the trauma involved in genocide has altered gene expression not only of Holocaust survivors (as well as descendents of African slaves and conquered Native American) but of their descendents. is there any theoretical or empirical basis for that proposal?

We are posting the response here in case others come across the same claim. Please feel free to point out any error or disagreement in the comment section.

Our response:

In 2010, Drs. Davidson, Hobert and Ptashne published a letter in Nature titled “Questions over the scientific basis of epigenome project”. In addition, Dr. Mark Ptashne published two articles -

Epigenetics: Core misconcept

Capture

Faddish Stuff: Epigenetics and the Inheritance of Acquired Characteristics

(The above image from the paper is a beautiful illustration of idiocy behind the concept of epigenetics and inheritance of acquired characteristics.)

You will find those articles helpful regarding your question. Needless to add, Professors Davidson, Hobert and Ptashne are all highly respected researchers in genetics.

To give very brief answer to your question – “are there any theoretical or empirical basis for that proposal?” – no, definitely not in humans or any organism remotely similar to humans. I sent Dr. Hobert your question and he also said – “Allow me to add that the claim that you cite is idiotic IF it’s put in the context of histone modifications.”

Here is the long reply.

The word ‘epigenetics’ was introduced by biologist C. H. Waddington in
1940s to explain how the cells maintained their states during development.

For example, the muscle cells, once differentiated, continue to divide into muscle cells and the same is true for other kinds of cells, even though they all start from one universal mother cell and carry the same chromosomes after division. How does a cell get programmed into being a muscle cell or neuronal cell? Waddington speculated that there must be some ‘epigenetic’ effect, because all cells continue to have the same chromosomes and therefore the same set of genes.

Please note that in 1942, there was no DNA sequence and Waddington was trying to explain things in abstract terms. In the following 60+ years, Waddington’s question was fully answered by geneticists and developmental biologists. It is found that the cells remember their states (muscle or neuron or kidney, etc.) through the actions of a set of genes called transcription factors acting on other genes through binding sites on the chromosomes. Both Professor Ptashne and Professor Davidson’s labs did extensive research to elucidate the mechanisms, and Waddington’s ‘epigenetics’ was explained through their work.

In the meanwhile, a group of scientists hijacked the term ‘epigenetic’, started to make unsubstantiated claims about DNA methylation and histone modifications and convinced NIH to give them large pots of money. The bad claims on ‘epigenetic’ effects you cited are due to their work and I think you will find Ptashne’s linked paper informative in that regard.

Regarding your question -

“It has been posited that the trauma involved in genocide has altered gene expression not only of Holocaust survivors (as well as descendents of African slaves and conquered Native American) but of their descendents. is there any theoretical or empirical basis for that proposal?”

I am always open to any new idea, but when I try to confirm such claims by reading the primary documents, I notice several problems -

(i) The memory mechanism (DNA methylation, histone modification) suggested in the epigenetics papers cannot be transferred from cell to cell. So, even if I entertain the idea that one cell or one group of cells got harmed due to stress, that effect is supposed to go away after the cell(s) divides,

(ii) The above issue becomes even more critical, when one tries to make a multi-generational claim. In one’s body, only a small fraction of cells (germ cells) can carry information to the children. Therefore, there has to be a mechanism for the ‘epigenetic stress due to histone modification’ to be transfered to germ cells in order to be transmitted to the next generation. I do not know of any mechanism given that histone methylation itself does not get carried from cell to cell.

(iii) Since I am bothered by such basic issues, I try to check the details of the experiments making bold claims related to epigenetics and there I find another problem. Usually the experiments tend to be very data poor. Typical genome-wide association study experiments are performed on thousands of individuals and even after that the researchers claim that they have difficulty finding genetic basis of complex diseases. In that context, using 20 or 30 mice to make a bold claim on ‘epigenetics’ is laughable.

That brings us to the broader question – is stress-induced multi-generation effect ever seen in anywhere in nature? That is where Dr. Hobert’s lab comes in. They shown starvation-induced trans-generational inheritance in worms -

Starvation-Induced Transgenerational Inheritance of Small RNAs in C. elegans

The mechanism, however, is not methylation or histone acetylation, but is something completely different (small RNA). Nothing like this has been demonstrated in humans, and given the genetic distance between worms and humans, I would like to see real evidence before accepting such a mechanism being in action in humans.

————————–

For our readers’ convenience, we reproduce the letter to Nature by Davidson, Hobert and Ptashne, because we had difficulty finding it for a while.

Questions over the scientific basis of epigenome project

We were astonished to see two sentences in your Editorial on the International Human Epigenome Consortium (Nature 463, 587; 2010) that seem to disregard principles of gene regulation and of evolutionary and developmental biology that have been established during the past 50 years.

You say: “By 2004, largescale genome projects were already indicating that genome sequences, within and across species, were too similar to be able to explain the diversity of life. It was instead clear that epigenetics — those changes to gene expression caused by chemical modification of DNA and its associated proteins — could explain much about how these similar genetic codes are expressed uniquely in different cells, in different environmental conditions and at different times.”

With respect to ‘epigenomics’, we wish to stress that chromatin ‘marks’ and local chemical modifications of DNA, such as methylation, are the
consequences of DNA-sequence-specific interactions of proteins
(and RNA) that recruit modifying enzymes to specific targets. They are thus directly dependent on the genomic sequence. Such marks are the effects of sequence-specific regulatory interactions, not the causes of cell-type-specific gene expression. Gene regulation explains in outline, and in many cases in detail, how similar genomes give rise to different organisms. Self-maintaining loops of regulatory gene expression are switched on and off, producing the epigenetic effects that lie at them heart of development. Whereas the protein-coding toolkit is in part similar among species, the remainder of the genome,
including almost all the regulatory sequence, is unique to each clade:
therein lies the explanation for the diversity of animal life.

A letter signed by eight prominent scientists (not including us), and an associated petition signed by more than 50, went into these matters in greater detail, and expressed serious reservations about the scientific basis of the epigenome project. A modified version of the letter appeared in Science (H. D. Madhani et al. Science 322, 43–44;
2008) — the complete letter can be found at http://madhanilab.ucsf.edu/epigenomics.

Using Mathematical Induction to Design Algorithms

We are revisiting the 1980s, thanks to Gene Myers, and came across this elegant 1988 paper by Udi Manber that our readers may find helpful. Manber later wrote a book “Introduction to Algorithms – A Creative Approach”, where he expanded on the same ideas (more on that later).

This article presents a methodology, based on mathematical induction, for approaching the design and the teaching of combinatorial algorithms. While this methodology does not cover all possible ways of designing algorithms it does cover many known techniques. It also provides an elegant intuitive framework for explaining the design of algorithms in more depth. The heart of the methodology lies in an analogy between the intellectual process of proving mathematical theorems and that of designing combinatorial algorithms. We claim that although these two processes serve different purposes and achieve different types of results, they are more similar than it seems. This claim is established here by a series of examples of algorithms, each developed and explained by the use of the methodology. We believe that students can get more motivation, greater depth, and better understanding of algorithms by this methodology.

Mathematical induction is a very powerful proof technique. It usually works as follows. Let T be a theorem that we want to prove. Suppose that 7’ includes a parameter n whose value can be any natural number. Instead of proving directly that T holds for all values of n we prove that (1) T holds for n = 1, and (2) T holds for any n > 1 provided that T holds for n – 1. The first part is usually very simple to prove. Proving the second part is easier in many cases than proving the theorem directly since we can use the assumption that T holds for n – 1. (In some sense we get this assumption for free.) In other words, it is enough to reduce the theorem to one with a smaller value of n, rather than proving it from scratch. We concentrate on this reduction. The same principle holds for algorithms. Induction allows one to concentrate on extending solutions of smaller subproblems to those of larger problems. One starts with an arbitrary instance of the problem at hand, and tries to solve it by using the assumption that the same problem but with smaller size has already been solved. For example, given a sequence of n > 1 numbers to sort (it is trivial to sort one number), we can assume that we already know how to sort n – 1 numbers. Then we can either sort the first n – 1 numbers and insert the nth number in its correct position (which leads to an algorithm called insertion sort), or start by putting the nth number in its final position and then sort the rest (which is called selection sort). We need only to address the operation on the nth number. (Of course, this is not the only way to sort, nor is it the only way to use induction for sorting.)

Speaking of Manber’s book, here is an useful review by a reader at Amazon. We emphasized the most important part.

This book is much more than a catalog of algorithms (e.g., CLR): its purpose is to train your intuition to recognize mathematical structure in abstract problems. What does it matter if you know Dijkstra’s algorithm? It’s much more valuable to have good intuitions and a inductive reasoning tool chest with which to smash apart all of the variations of the shortest path problem (for example.)

The reviewers who wrote that the book “assumes you are a math wiz” and that it provides “little or no guidance for solving an arbitrary problem of the same type” didn’t get it. This book is trying very hard to make you into a wiz by forcing you to really interact with mathematics, rather than working through a set of nearly identical problems (–what passes for “education” in North America.)

I was just going to leave my review at that, but since the reviews that people find “helpful” are so way off base, I think I should throw in a relevant story.

When my friend was in grade 11, he showed up to the Canadian Computing Competition finals, placing 14th. The guy who won told him, “if you want to win, read this book.” Two years later, he won the CCC against 2000 other students. This book is the best introduction you can give a budding mathematician.

Sure, you can cough up what you’ve memorized from CLR during your university algorithms course. But, do you want to learn to invent algorithms yourself?

Math is not something handed down generations in big, dry textbooks. Like all knowledge, math is organically discovered Truth, and you have learn to discover it for yourself.

Minia’s New Home – GATB (Genome Assembly & Analysis Tool Box)

The readers of our blog should be very familiar with parts of this project through their introduction with Minia assembler and DSK k-mer counter. Minia uses the same idea as diginorm (Bloom filters), but builds an entire assembler with it. Now Rayan Chikhi, Guillaume Rizk, Dominique Lavenier and their collaborators have converted those programs into an entire library with useful modules.

Link

Motivation: Efficient and fast NGS algorithms are essential to analyze the terabytes of data generated by the next generation sequencing machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation.

Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with very low memory footprints.

Availability and Implementation: The GATB library is written in C++ and is available at the following web site http://gatb.inria.fr under the A-GPL license.

The New K-mer Counter and a Genome Assembly – (ii)

In the last commentary, we have shown a few examples of doing k-mer counting with the new super-efficient program – kmc-2. This commentary will provide the context. Readers unfamiliar with how to interpret k-mer distributions of short reads are also encouraged to go through an old blog post – Maximizing Utility of Available RAMs in K-mer World.

Here is our story. We started assembling two sets of Illumina libraries discussed in previous commentary using Minia and noticed something odd. At first, we assembled the set2 (see previous commentary for reference) at k=63 using Minia. Total size of all assembled contigs came to be around 700Mb with N50=982. So far, so good.

Another bioinformatician helped us assemble set1 and set2 together with Minia. Since the depth of sequencing was very high, we chose to do this assembly at k=73. The total assembly size came to be roughly the same, but, quite puzzlingly, N50 dropped to 219 !

To figure out what was going on, we took set2 and did k-mer counting at k=63. The distribution came out to be usual (unimodal with peak around 23, ignoring the noise peak near count=1). As explained in the earlier commentary, the location of the mode of the distribition is attributable to the depth of sequencing and the amount of repeats in the genome. The peak at 23 was somewhat low given how deeply this genome was sequenced (~50 was expected), but we also knew that this organism had extensive duplication. Therefore, there was nothing odd with the k-mer distribution as such.

—————————————————–

Armed with the new k-mer counter, we started to do forensic analysis on the libraries and here is what we found. The following three figures show k-mer distributions for k=21. Especially pay attention to the third chart below.

first set, k=21

set21-200

second set, k=21

set21-300

both sets, k=21

set21-both

The above chart showing k-mer distribution of the combined libraries has two distinct peaks and one at the right location (~120) derived from the coverage, whereas the other one is at half of that number. The above pattern very likely indicates high degree of heterozygosity in our sample. Two chromosomes in diploid genomes are usually very similar and therefore total k-mer coverage is unimodal. When the difference between the chromosomes start to increase, a second peak appears at location that is half of the primary peak.

We first thought that the two PE libraries (set1 and set2) were prepared with samples from two different animals, but our collaborator confirmed that they were from the same animal. In fact, k-mer distribution from each set also shows two separate peaks, even though those peaks are somewhat less pronounced than when the sets are combined together.

————————————————————–

When we plotted the distribution at increasing k-values, quite an interesting picture emerged. First check the following figures and then see the bottom of the post for interpretation.

first set, k=27

set27-200

second set, k=27

set27-300

both sets, k=27

set27-both

————————————————————–

first set, k=33

set33-200

second set, k=33

set33-300

both sets, k=33

set33-both

————————————————————–

first set, k=39

set39-200

second set, k=39

set39-300

both sets, k=39

set39-both

————————————————————–

first set, k=45

set45-200

second set, k=45

set45-300

both sets, k=45

set45-both

————————————————————–

first set, k=51
set51-200

second set, k=51
set51-300

————————————————————–

both sets, k=63

set63-both

————————————————————–

What is going on? We suspect the differences between the two chromosomes are in many short patches and are present all over the genome. Also, we can guess that their relative distances are around 40 (27 + 27/2). So, when we take k-mer distribution at k=21, a large number of k-mers from two chromosomes appear to be identical, thus making the peak at higher coverage taller than the peak at lower coverage. As k is increased from 21 to 27, 33, 39, 45, 51 and finally 63, two chromosomes start to differ everywhere, thus leaving with very few identical k-mers from both copies of the chromosomes. Therefore, the peak at higher value disappears at k=63.

That would certainly be very odd and these charts will keep us awake all night.

Other interpretations are welcome ! Also please let me know, if the descriptions provided above of what we are doing are not sufficient.

The New K-mer Counter and a Genome Assembly – (i)

Now that Sebastian Deorowicz et al. published a damn good k-mer counter, let us put it to some real work. The first half of this two-part series will present some benchmarks on the program, and the second half will discuss our use on a real-life genome assembly problem.

Installation

The installation of kmc 2 is fairly straightforward. We went to their website, downloaded the source code, compiled and installed the program. The compilation needs Boost library, and if you have it properly installed, everything else should go well to give you two programs – kmc and kmc_dump. They also have pre-compiled binaries, but we downloaded the source code to be able to poke around in the near future.

Description of our project

We have two sets of Illumina PE reads from a vertebrate species that swims. The first set (set1) has 1,588,478,432 100nt reads and the second set (set2) has 1,376,496,476 100nt reads. We did k-mer counting in set1, set2 and set1+set2 for k=21, 27, 33, 39 and 45. We also did k-mer counting for set1 at k=51, set2 at k=51 and set1+set2 at k=63. Yes, that is a lot of k-mer counting.

This was done in a 32-core AMD server with 64-bit Linux, non-SSD hard-drive and 512 GB RAM. The amount of RAM is somewhat irrelevant for this highly efficient program.

Time

I started the k-mer counting tasks around midnight and got all results back by ~7AM. Wow! They were run serially, while the server was also running twenty blastp jobs from another user.

set1 21mer counting took 585.523s.
set2 21mer counting took 567.74s.
set1+set2 21mer counting took 1507.44s.

set1 27mer counting took 612.069s.
set2 27mer counting took 534.832s.
set1+set2 27mer counting took 1361.61s.

set1 33mer counting took 1197.23s.
set1 33mer counting took 1037.95s.
set1 33mer counting took 2440.34s.

We noticed a jump in time needed for k>32, but no big differences after that for k=33, 39, 45 and 51.

Here is the output file of first k-mer counting job (set1 for k=21) so that you have an idea about all temporary space requirements.

1st stage: 258.373s
2nd stage: 327.15s
Total : 585.523s
Tmp size : 129830MB

Stats:
No. of k-mers below min. threshold : 3342104765
No. of k-mers above max. threshold : 0
No. of unique k-mers : 4734755445
No. of unique counted k-mers : 1392650680
Total no. of k-mers : 126120517209
Total no. of reads : 1588478432
Total no. of super-k-mers : 15387969776
1st stage: 252.294s
2nd stage: 315.446s
Total : 567.74s
Tmp size : 112898MB

Command to Run and Output Files

The command kmc can be run either on a single fasta/fastq file or multiple files. The first two commands below are for single files and the third one is for a file named ‘list’, which lists both fasta files.

./kmc -k21 -m200 -fa s200-data.fa out21-200 tmp
./kmc -k21 -m200 -fa s300-data.fa out21-300 tmp
./kmc -k21 -m200 -fa @list out21-both tmp

The program kmc produces two binary output files (out21-200.kmc_pre – 1.1G and out21-200.kmc_suf 5.2G). Interestingly, in all k-mer counting examples we ran, the size of the ‘pre’ file varied between 1073807452 bytes and 67174492 bytes, and we do not have much clue why. This is where the code will come handy. One clue – the file size appears to be k-dependent.

The program kmc_dump takes those two binary files, and gives you the k-mer counts in two-column text format. Command to run -

./kmc_dump -cx0 out21-200 o

If the above commentary appears rather drab, the next commentary will surely add some color to it.

After SAM and BAM Comes ADAM !

A Berkeley group is proposing new data format.

Current genomics data formats and processing pipelines are not designed to scale well to large datasets. The current Sequence/Binary Alignment/Map (SAM/BAM) formats were intended for single node processing. There have been attempts to adapt BAM to distributed computing environments, but they see limited scalability past eight nodes. Additionally, due to the lack of an explicit data schema, there are well known incompatibilities between libraries that implement SAM/BAM/Variant Call Format (VCF) data access. To address these problems, we introduce ADAM, a set of formats, APIs, and processing stage implementations for genomic data. ADAM is fully open source under the Apache 2 license, and is implemented on top of Avro and Parquet for data storage. Our reference pipeline is implemented on top of Spark, a high performance in-memory map-reduce system. This combination provides the following advantages: 1) Avro provides explicit data schema access in C/C++/C#, Java/Scala, Python, php, and Ruby; 2) Parquet allows access by database systems like Impala and Shark; and 3) Spark improves performance through in-memory caching and reducing disk I/O. In addition to improving the format’s cross-platform portability, these changes lead to significant performance improvements. On a single node, we are able to speedup sort and duplicate marking by 2×. More importantly, on a 250 Gigabyte (GB) high (60×) coverage human genome, this system achieves a 50× speedup on a 100 node computing cluster (see Table 1), fulfilling the promise of scalability of ADAM. The ADAM format provides explicit schemas for read and reference oriented (pileup) sequence data, variants, and genotypes. As the schemas are implemented in Apache Avro—a cross-platform/language serialization format—they eliminate the need for the development of language-specific libraries for format decoding/encoding, which eliminates the possibility of library incompatibilities. A key feature of ADAM is that any application that implements the ADAM schema is compatible with ADAM. This is important, as it prevents applications from being locked into a specific tool or pattern. The ADAM stack is inspired by the “narrow waist” of the Internet Protocol (IP) suite (see Figure 2). We consider the explicit use of a schema in this format to be the greatest contribution of the ADAM stack. In addition to the advantages outlined above, ADAM eliminates the file headers in modern genomics formats. All header information is available inside of each individual record. The variant and genotype formats also demonstrate two significant improvements. First, these formats are co-designed so that variant data can be seamlessly calculated from a given collection of sample genotypes. Secondly, these formats are designed to flexibly accommodate annotations without cluttering the core variant/genotype schema. In addition to the benefits described above, ADAM files are up to 25% smaller on disk than compressed BAM files without losing any information. The ADAM processing pipeline uses Spark as a compute engine and Parquet for data access. Spark is an in-memory MapReduce framework which minimizes I/O accesses. We chose Parquet for data storage as it is an open-source columnar store that is designed for distribution across multiple computers with high compression. Additionally, Parquet sup- ports efficient methods (predicates and projections) for accessing only a specific segment or fields of a file, which can provide significant (2-10×) additional speedup for genomics data access patterns.