Archives

Categories

Y
Y

How Does the Unix Diff Algorithm Work?

A link from stackoverflow says -

The basic algorithm is described in “An O(ND) Difference Algorithm and its Variations”, Eugene W. Myers, ‘Algorithmica’ Vol. 1 No. 2, 1986, pp. 251-266; and in “A File Comparison Program”, Webb Miller and Eugene W. Myers, ‘Software–Practice and Experience’ Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in “Algorithms for Approximate String Matching”, E. Ukkonen, `Information and Control’ Vol. 64, 1985, pp. 100-118.

We checked Myers’ original paper and found an elegant graph-based greedy algorithm.

The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N+D2 ) expected-time performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D2 ) time variation.

The paper is well-written, but for those looking for additional explanation may find the following website helpful -

Investigating Myers’ diff algorithm: Part 1 of 2

Investigating Myers’ Diff Algorithm: Part 2 of 2

In bioinformatics sequence comparison, can we start doing ‘diff’ between sequences instead of going through other elaborate search programs? Such as taking 1 million long pacbio reads and doing ‘diff’ between pairs to see whether they are closely related?

More on that in a later commentary.

Insane Decision of the Day – Broad Institute Gets $650M to Cure Insanity !!

Over the last few months, I have been working on a rebuttal paper in the field of genetics + psychology and found an unbelievable number of garbage papers from well-known universities. After carefully pondering about what was going on, I came to the conclusion that too much money had been chasing too few ideas in the field. Abundance of money is the root cause of irreproducibility of much of the new era science. The areas of research flush with stupid money are seeing the highest recall rates.

Now to make things worse -

$650M gift to Broad seeks to propel psychiatric research

The Broad Institute today announced an unprecedented commitment of $650 million from philanthropist Ted Stanley aimed at galvanizing scientific research on psychiatric disorders and bringing new treatments based on molecular understanding to hundreds of millions of people around the world.

The Stanley commitment — the largest ever in psychiatric research and among the largest for scientific research in general — will support research by a collaborative network of researchers within the Stanley Center for Psychiatric Research at the Broad Institute, a biomedical research institution that brings together faculty from the Massachusetts Institute of Technology, Harvard University, the Harvard-affiliated hospitals, and collaborators worldwide.

We are back to the old medieval alchemy mentality: “The defining objectives of alchemy are varied but historically have typically included one or more of the following goals: the creation of the fabled philosopher’s stone; the ability to transform base metals into the noble metals (gold or silver); and development of an elixir of life, which would confer youth and longevity.”

Those looking for variety of opinions can check the comment section of ‘In the Pipeline’ blog covering this topic -
The Broad Gets $650 Million For Psychiatric Research

‘Epigenetics’ Explained, in the Context of Descendants of Holocaust and Slavery Victims

A renowned psychologist asked us by email:

It has been posited that the trauma involved in genocide has altered gene expression not only of Holocaust survivors (as well as descendents of African slaves and conquered Native American) but of their descendents. is there any theoretical or empirical basis for that proposal?

We are posting the response here in case others come across the same claim. Please feel free to point out any error or disagreement in the comment section.

Our response:

In 2010, Drs. Davidson, Hobert and Ptashne published a letter in Nature titled “Questions over the scientific basis of epigenome project”. In addition, Dr. Mark Ptashne published two articles -

Epigenetics: Core misconcept

Capture

Faddish Stuff: Epigenetics and the Inheritance of Acquired Characteristics

(The above image from the paper is a beautiful illustration of idiocy behind the concept of epigenetics and inheritance of acquired characteristics.)

You will find those articles helpful regarding your question. Needless to add, Professors Davidson, Hobert and Ptashne are all highly respected researchers in genetics.

To give very brief answer to your question – “are there any theoretical or empirical basis for that proposal?” – no, definitely not in humans or any organism remotely similar to humans. I sent Dr. Hobert your question and he also said – “Allow me to add that the claim that you cite is idiotic IF it’s put in the context of histone modifications.”

Here is the long reply.

The word ‘epigenetics’ was introduced by biologist C. H. Waddington in
1940s to explain how the cells maintained their states during development.

For example, the muscle cells, once differentiated, continue to divide into muscle cells and the same is true for other kinds of cells, even though they all start from one universal mother cell and carry the same chromosomes after division. How does a cell get programmed into being a muscle cell or neuronal cell? Waddington speculated that there must be some ‘epigenetic’ effect, because all cells continue to have the same chromosomes and therefore the same set of genes.

Please note that in 1942, there was no DNA sequence and Waddington was trying to explain things in abstract terms. In the following 60+ years, Waddington’s question was fully answered by geneticists and developmental biologists. It is found that the cells remember their states (muscle or neuron or kidney, etc.) through the actions of a set of genes called transcription factors acting on other genes through binding sites on the chromosomes. Both Professor Ptashne and Professor Davidson’s labs did extensive research to elucidate the mechanisms, and Waddington’s ‘epigenetics’ was explained through their work.

In the meanwhile, a group of scientists hijacked the term ‘epigenetic’, started to make unsubstantiated claims about DNA methylation and histone modifications and convinced NIH to give them large pots of money. The bad claims on ‘epigenetic’ effects you cited are due to their work and I think you will find Ptashne’s linked paper informative in that regard.

Regarding your question -

“It has been posited that the trauma involved in genocide has altered gene expression not only of Holocaust survivors (as well as descendents of African slaves and conquered Native American) but of their descendents. is there any theoretical or empirical basis for that proposal?”

I am always open to any new idea, but when I try to confirm such claims by reading the primary documents, I notice several problems -

(i) The memory mechanism (DNA methylation, histone modification) suggested in the epigenetics papers cannot be transferred from cell to cell. So, even if I entertain the idea that one cell or one group of cells got harmed due to stress, that effect is supposed to go away after the cell(s) divides,

(ii) The above issue becomes even more critical, when one tries to make a multi-generational claim. In one’s body, only a small fraction of cells (germ cells) can carry information to the children. Therefore, there has to be a mechanism for the ‘epigenetic stress due to histone modification’ to be transfered to germ cells in order to be transmitted to the next generation. I do not know of any mechanism given that histone methylation itself does not get carried from cell to cell.

(iii) Since I am bothered by such basic issues, I try to check the details of the experiments making bold claims related to epigenetics and there I find another problem. Usually the experiments tend to be very data poor. Typical genome-wide association study experiments are performed on thousands of individuals and even after that the researchers claim that they have difficulty finding genetic basis of complex diseases. In that context, using 20 or 30 mice to make a bold claim on ‘epigenetics’ is laughable.

That brings us to the broader question – is stress-induced multi-generation effect ever seen in anywhere in nature? That is where Dr. Hobert’s lab comes in. They shown starvation-induced trans-generational inheritance in worms -

Starvation-Induced Transgenerational Inheritance of Small RNAs in C. elegans

The mechanism, however, is not methylation or histone acetylation, but is something completely different (small RNA). Nothing like this has been demonstrated in humans, and given the genetic distance between worms and humans, I would like to see real evidence before accepting such a mechanism being in action in humans.

————————–

For our readers’ convenience, we reproduce the letter to Nature by Davidson, Hobert and Ptashne, because we had difficulty finding it for a while.

Questions over the scientific basis of epigenome project

We were astonished to see two sentences in your Editorial on the International Human Epigenome Consortium (Nature 463, 587; 2010) that seem to disregard principles of gene regulation and of evolutionary and developmental biology that have been established during the past 50 years.

You say: “By 2004, largescale genome projects were already indicating that genome sequences, within and across species, were too similar to be able to explain the diversity of life. It was instead clear that epigenetics — those changes to gene expression caused by chemical modification of DNA and its associated proteins — could explain much about how these similar genetic codes are expressed uniquely in different cells, in different environmental conditions and at different times.”

With respect to ‘epigenomics’, we wish to stress that chromatin ‘marks’ and local chemical modifications of DNA, such as methylation, are the
consequences of DNA-sequence-specific interactions of proteins
(and RNA) that recruit modifying enzymes to specific targets. They are thus directly dependent on the genomic sequence. Such marks are the effects of sequence-specific regulatory interactions, not the causes of cell-type-specific gene expression. Gene regulation explains in outline, and in many cases in detail, how similar genomes give rise to different organisms. Self-maintaining loops of regulatory gene expression are switched on and off, producing the epigenetic effects that lie at them heart of development. Whereas the protein-coding toolkit is in part similar among species, the remainder of the genome,
including almost all the regulatory sequence, is unique to each clade:
therein lies the explanation for the diversity of animal life.

A letter signed by eight prominent scientists (not including us), and an associated petition signed by more than 50, went into these matters in greater detail, and expressed serious reservations about the scientific basis of the epigenome project. A modified version of the letter appeared in Science (H. D. Madhani et al. Science 322, 43–44;
2008) — the complete letter can be found at http://madhanilab.ucsf.edu/epigenomics.

Using Mathematical Induction to Design Algorithms

We are revisiting the 1980s, thanks to Gene Myers, and came across this elegant 1988 paper by Udi Manber that our readers may find helpful. Manber later wrote a book “Introduction to Algorithms – A Creative Approach”, where he expanded on the same ideas (more on that later).

This article presents a methodology, based on mathematical induction, for approaching the design and the teaching of combinatorial algorithms. While this methodology does not cover all possible ways of designing algorithms it does cover many known techniques. It also provides an elegant intuitive framework for explaining the design of algorithms in more depth. The heart of the methodology lies in an analogy between the intellectual process of proving mathematical theorems and that of designing combinatorial algorithms. We claim that although these two processes serve different purposes and achieve different types of results, they are more similar than it seems. This claim is established here by a series of examples of algorithms, each developed and explained by the use of the methodology. We believe that students can get more motivation, greater depth, and better understanding of algorithms by this methodology.

Mathematical induction is a very powerful proof technique. It usually works as follows. Let T be a theorem that we want to prove. Suppose that 7’ includes a parameter n whose value can be any natural number. Instead of proving directly that T holds for all values of n we prove that (1) T holds for n = 1, and (2) T holds for any n > 1 provided that T holds for n – 1. The first part is usually very simple to prove. Proving the second part is easier in many cases than proving the theorem directly since we can use the assumption that T holds for n – 1. (In some sense we get this assumption for free.) In other words, it is enough to reduce the theorem to one with a smaller value of n, rather than proving it from scratch. We concentrate on this reduction. The same principle holds for algorithms. Induction allows one to concentrate on extending solutions of smaller subproblems to those of larger problems. One starts with an arbitrary instance of the problem at hand, and tries to solve it by using the assumption that the same problem but with smaller size has already been solved. For example, given a sequence of n > 1 numbers to sort (it is trivial to sort one number), we can assume that we already know how to sort n – 1 numbers. Then we can either sort the first n – 1 numbers and insert the nth number in its correct position (which leads to an algorithm called insertion sort), or start by putting the nth number in its final position and then sort the rest (which is called selection sort). We need only to address the operation on the nth number. (Of course, this is not the only way to sort, nor is it the only way to use induction for sorting.)

Speaking of Manber’s book, here is an useful review by a reader at Amazon. We emphasized the most important part.

This book is much more than a catalog of algorithms (e.g., CLR): its purpose is to train your intuition to recognize mathematical structure in abstract problems. What does it matter if you know Dijkstra’s algorithm? It’s much more valuable to have good intuitions and a inductive reasoning tool chest with which to smash apart all of the variations of the shortest path problem (for example.)

The reviewers who wrote that the book “assumes you are a math wiz” and that it provides “little or no guidance for solving an arbitrary problem of the same type” didn’t get it. This book is trying very hard to make you into a wiz by forcing you to really interact with mathematics, rather than working through a set of nearly identical problems (–what passes for “education” in North America.)

I was just going to leave my review at that, but since the reviews that people find “helpful” are so way off base, I think I should throw in a relevant story.

When my friend was in grade 11, he showed up to the Canadian Computing Competition finals, placing 14th. The guy who won told him, “if you want to win, read this book.” Two years later, he won the CCC against 2000 other students. This book is the best introduction you can give a budding mathematician.

Sure, you can cough up what you’ve memorized from CLR during your university algorithms course. But, do you want to learn to invent algorithms yourself?

Math is not something handed down generations in big, dry textbooks. Like all knowledge, math is organically discovered Truth, and you have learn to discover it for yourself.

Minia’s New Home – GATB (Genome Assembly & Analysis Tool Box)

The readers of our blog should be very familiar with parts of this project through their introduction with Minia assembler and DSK k-mer counter. Minia uses the same idea as diginorm (Bloom filters), but builds an entire assembler with it. Now Rayan Chikhi, Guillaume Rizk, Dominique Lavenier and their collaborators have converted those programs into an entire library with useful modules.

Link

Motivation: Efficient and fast NGS algorithms are essential to analyze the terabytes of data generated by the next generation sequencing machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation.

Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with very low memory footprints.

Availability and Implementation: The GATB library is written in C++ and is available at the following web site http://gatb.inria.fr under the A-GPL license.

The New K-mer Counter and a Genome Assembly – (ii)

In the last commentary, we have shown a few examples of doing k-mer counting with the new super-efficient program – kmc-2. This commentary will provide the context. Readers unfamiliar with how to interpret k-mer distributions of short reads are also encouraged to go through an old blog post – Maximizing Utility of Available RAMs in K-mer World.

Here is our story. We started assembling two sets of Illumina libraries discussed in previous commentary using Minia and noticed something odd. At first, we assembled the set2 (see previous commentary for reference) at k=63 using Minia. Total size of all assembled contigs came to be around 700Mb with N50=982. So far, so good.

Another bioinformatician helped us assemble set1 and set2 together with Minia. Since the depth of sequencing was very high, we chose to do this assembly at k=73. The total assembly size came to be roughly the same, but, quite puzzlingly, N50 dropped to 219 !

To figure out what was going on, we took set2 and did k-mer counting at k=63. The distribution came out to be usual (unimodal with peak around 23, ignoring the noise peak near count=1). As explained in the earlier commentary, the location of the mode of the distribition is attributable to the depth of sequencing and the amount of repeats in the genome. The peak at 23 was somewhat low given how deeply this genome was sequenced (~50 was expected), but we also knew that this organism had extensive duplication. Therefore, there was nothing odd with the k-mer distribution as such.

—————————————————–

Armed with the new k-mer counter, we started to do forensic analysis on the libraries and here is what we found. The following three figures show k-mer distributions for k=21. Especially pay attention to the third chart below.

first set, k=21

set21-200

second set, k=21

set21-300

both sets, k=21

set21-both

The above chart showing k-mer distribution of the combined libraries has two distinct peaks and one at the right location (~120) derived from the coverage, whereas the other one is at half of that number. The above pattern very likely indicates high degree of heterozygosity in our sample. Two chromosomes in diploid genomes are usually very similar and therefore total k-mer coverage is unimodal. When the difference between the chromosomes start to increase, a second peak appears at location that is half of the primary peak.

We first thought that the two PE libraries (set1 and set2) were prepared with samples from two different animals, but our collaborator confirmed that they were from the same animal. In fact, k-mer distribution from each set also shows two separate peaks, even though those peaks are somewhat less pronounced than when the sets are combined together.

————————————————————–

When we plotted the distribution at increasing k-values, quite an interesting picture emerged. First check the following figures and then see the bottom of the post for interpretation.

first set, k=27

set27-200

second set, k=27

set27-300

both sets, k=27

set27-both

————————————————————–

first set, k=33

set33-200

second set, k=33

set33-300

both sets, k=33

set33-both

————————————————————–

first set, k=39

set39-200

second set, k=39

set39-300

both sets, k=39

set39-both

————————————————————–

first set, k=45

set45-200

second set, k=45

set45-300

both sets, k=45

set45-both

————————————————————–

first set, k=51
set51-200

second set, k=51
set51-300

————————————————————–

both sets, k=63

set63-both

————————————————————–

What is going on? We suspect the differences between the two chromosomes are in many short patches and are present all over the genome. Also, we can guess that their relative distances are around 40 (27 + 27/2). So, when we take k-mer distribution at k=21, a large number of k-mers from two chromosomes appear to be identical, thus making the peak at higher coverage taller than the peak at lower coverage. As k is increased from 21 to 27, 33, 39, 45, 51 and finally 63, two chromosomes start to differ everywhere, thus leaving with very few identical k-mers from both copies of the chromosomes. Therefore, the peak at higher value disappears at k=63.

That would certainly be very odd and these charts will keep us awake all night.

Other interpretations are welcome ! Also please let me know, if the descriptions provided above of what we are doing are not sufficient.

The New K-mer Counter and a Genome Assembly – (i)

Now that Sebastian Deorowicz et al. published a damn good k-mer counter, let us put it to some real work. The first half of this two-part series will present some benchmarks on the program, and the second half will discuss our use on a real-life genome assembly problem.

Installation

The installation of kmc 2 is fairly straightforward. We went to their website, downloaded the source code, compiled and installed the program. The compilation needs Boost library, and if you have it properly installed, everything else should go well to give you two programs – kmc and kmc_dump. They also have pre-compiled binaries, but we downloaded the source code to be able to poke around in the near future.

Description of our project

We have two sets of Illumina PE reads from a vertebrate species that swims. The first set (set1) has 1,588,478,432 100nt reads and the second set (set2) has 1,376,496,476 100nt reads. We did k-mer counting in set1, set2 and set1+set2 for k=21, 27, 33, 39 and 45. We also did k-mer counting for set1 at k=51, set2 at k=51 and set1+set2 at k=63. Yes, that is a lot of k-mer counting.

This was done in a 32-core AMD server with 64-bit Linux, non-SSD hard-drive and 512 GB RAM. The amount of RAM is somewhat irrelevant for this highly efficient program.

Time

I started the k-mer counting tasks around midnight and got all results back by ~7AM. Wow! They were run serially, while the server was also running twenty blastp jobs from another user.

set1 21mer counting took 585.523s.
set2 21mer counting took 567.74s.
set1+set2 21mer counting took 1507.44s.

set1 27mer counting took 612.069s.
set2 27mer counting took 534.832s.
set1+set2 27mer counting took 1361.61s.

set1 33mer counting took 1197.23s.
set1 33mer counting took 1037.95s.
set1 33mer counting took 2440.34s.

We noticed a jump in time needed for k>32, but no big differences after that for k=33, 39, 45 and 51.

Here is the output file of first k-mer counting job (set1 for k=21) so that you have an idea about all temporary space requirements.

1st stage: 258.373s
2nd stage: 327.15s
Total : 585.523s
Tmp size : 129830MB

Stats:
No. of k-mers below min. threshold : 3342104765
No. of k-mers above max. threshold : 0
No. of unique k-mers : 4734755445
No. of unique counted k-mers : 1392650680
Total no. of k-mers : 126120517209
Total no. of reads : 1588478432
Total no. of super-k-mers : 15387969776
1st stage: 252.294s
2nd stage: 315.446s
Total : 567.74s
Tmp size : 112898MB

Command to Run and Output Files

The command kmc can be run either on a single fasta/fastq file or multiple files. The first two commands below are for single files and the third one is for a file named ‘list’, which lists both fasta files.

./kmc -k21 -m200 -fa s200-data.fa out21-200 tmp
./kmc -k21 -m200 -fa s300-data.fa out21-300 tmp
./kmc -k21 -m200 -fa @list out21-both tmp

The program kmc produces two binary output files (out21-200.kmc_pre – 1.1G and out21-200.kmc_suf 5.2G). Interestingly, in all k-mer counting examples we ran, the size of the ‘pre’ file varied between 1073807452 bytes and 67174492 bytes, and we do not have much clue why. This is where the code will come handy. One clue – the file size appears to be k-dependent.

The program kmc_dump takes those two binary files, and gives you the k-mer counts in two-column text format. Command to run -

./kmc_dump -cx0 out21-200 o

If the above commentary appears rather drab, the next commentary will surely add some color to it.

After SAM and BAM Comes ADAM !

A Berkeley group is proposing new data format.

Current genomics data formats and processing pipelines are not designed to scale well to large datasets. The current Sequence/Binary Alignment/Map (SAM/BAM) formats were intended for single node processing. There have been attempts to adapt BAM to distributed computing environments, but they see limited scalability past eight nodes. Additionally, due to the lack of an explicit data schema, there are well known incompatibilities between libraries that implement SAM/BAM/Variant Call Format (VCF) data access. To address these problems, we introduce ADAM, a set of formats, APIs, and processing stage implementations for genomic data. ADAM is fully open source under the Apache 2 license, and is implemented on top of Avro and Parquet for data storage. Our reference pipeline is implemented on top of Spark, a high performance in-memory map-reduce system. This combination provides the following advantages: 1) Avro provides explicit data schema access in C/C++/C#, Java/Scala, Python, php, and Ruby; 2) Parquet allows access by database systems like Impala and Shark; and 3) Spark improves performance through in-memory caching and reducing disk I/O. In addition to improving the format’s cross-platform portability, these changes lead to significant performance improvements. On a single node, we are able to speedup sort and duplicate marking by 2×. More importantly, on a 250 Gigabyte (GB) high (60×) coverage human genome, this system achieves a 50× speedup on a 100 node computing cluster (see Table 1), fulfilling the promise of scalability of ADAM. The ADAM format provides explicit schemas for read and reference oriented (pileup) sequence data, variants, and genotypes. As the schemas are implemented in Apache Avro—a cross-platform/language serialization format—they eliminate the need for the development of language-specific libraries for format decoding/encoding, which eliminates the possibility of library incompatibilities. A key feature of ADAM is that any application that implements the ADAM schema is compatible with ADAM. This is important, as it prevents applications from being locked into a specific tool or pattern. The ADAM stack is inspired by the “narrow waist” of the Internet Protocol (IP) suite (see Figure 2). We consider the explicit use of a schema in this format to be the greatest contribution of the ADAM stack. In addition to the advantages outlined above, ADAM eliminates the file headers in modern genomics formats. All header information is available inside of each individual record. The variant and genotype formats also demonstrate two significant improvements. First, these formats are co-designed so that variant data can be seamlessly calculated from a given collection of sample genotypes. Secondly, these formats are designed to flexibly accommodate annotations without cluttering the core variant/genotype schema. In addition to the benefits described above, ADAM files are up to 25% smaller on disk than compressed BAM files without losing any information. The ADAM processing pipeline uses Spark as a compute engine and Parquet for data access. Spark is an in-memory MapReduce framework which minimizes I/O accesses. We chose Parquet for data storage as it is an open-source columnar store that is designed for distribution across multiple computers with high compression. Additionally, Parquet sup- ports efficient methods (predicates and projections) for accessing only a specific segment or fields of a file, which can provide significant (2-10×) additional speedup for genomics data access patterns.

Minimizer Revolution Continues with KMC-2 K-mer Counter

This paper had been quite expected from either the KMC or DSK group, based on the success of minimizers so far (check – “2014 – The Year of Minimizers in Bioinformatics“). The issue is how to choose the minimizer, because alphabetical ordering creates a number of problems. Firstly, the sizes of partitions are unbalanced and low-entropy sequences like AAAAAA get too many hits. Rayan Chikhi et al. showed that choosing minimizers based on relative frequency of presence worked better, but that would require knowing all k-mers, whereas in k-mer counting one attempts to find the k-mers in the first place. Wood and Salzberg came up with another clever approach based on randomizing the bits using XOR.

This paper proposes two novel ideas -

There are two main ideas behind these improvements. The first is the use of signatures of k-mers that are a generalization of the idea of minimizers (Roberts et al., 2004a,b). Signatures allow significant reduction of temporary disk space. The minimizers were used for the first time for the k-mer counting in MSPKmerCounter, but our modification significantly reduces the main memory requirements (up to 3–5 times) as well as disk space (about 5 times) as compared to MSPKmerCounter. The second main novelty is the use of (k, x)-mers (x > 0) for reduction of the amount of data to sort. Simply speaking, instead of sorting some amount of k-mers we sort a much smaller portion of (k + x)-mers and then obtain the statistics for k-mers in the postprocessing phase.

We are still in the middle of going through the paper, but presume the second idea possibly came from their earlier paper on compression mentioned in our earlier blog post – Minimizer Success – Disk-based Genome Sequencing Data Compression. In the meanwhile, please feel free to comment on -

KMC 2: Fast and resource-frugal k-mer counting

Motivation: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.

Results: We present a novel method for k-mer counting, on large datasets at least twice faster than the strongest competitors (Jellyfish~2, KMC~1), using about 12\,GB (or less) of RAM memory. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k,x)-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC~2 allows to count the 28-mers of a human reads collection with 44-fold coverage (106\,GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD.

And the world cup goes to Gliwice and Lodz, Poland !

The presented KMC 2 algorithm is currently the fastest k-mer counter, with modest resource (memory and disk) requirements. Although the used approach is similar to the one from MSPKmerCounter, we obtain an order of magnitude faster processing, due to the following KMC features: replacing the original minimizers with signatures (a carefully selected subset of all minimizers), using (k, x)-mers and a highly parallel overall architecture. As opposed to most competitors, KMC 2 worked stably across a large range of datasets and test settings.

Kmers Rock – Improved Genome Inference in the MHC using a Population Reference Graph

This new paper appeared in biorxiv (h/t: Jared Simpson). As always, the supplementary section has the most meat. The authors love kmers so much that they came up with new English words (kMerification, kMerify) and new way to write it (kMer).

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and long-read data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.

Capture