An Easy-to-follow Introductory Book on NGS Assembly Algorithms

Dear Readers,
Continue reading An Easy-to-follow Introductory Book on NGS Assembly Algorithms

‘Best of 2014′ – My Personal Opinion

Last year, I assembled a panel of judges, who chose the best bioinformatics contributions of the year from a number of suggestions made by the readers. You can see the results in Announcing the Results of ‘Best of 2013′.

The entire process took quite a bit of time of everyone involved. Moreover, we had an early start to make the effort visible and get the most feedback from readers. This year I got busy and did not find enough time to notify the judges and the readers. So, instead of having the ‘Best of 2014′ in the same way as last year, I will share the ideas I found most interesting and encourage you to do the same in comment section.

1. Succint de Bruijn Graph

Variable-Order de Bruijn Graphs

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Very recently, two papers came out implementing the succint de Bruijn graph idea of Alex Bowe et al. I found both of them quite interesting. The idea of succint de Bruijn graph merges two concepts – (i) de Bruijn graphs for assembly, (ii) XBW transform for graphs with its associated rank and select structure. The paper on variable order de Bruijn graph showed how succint de Bruijn graph could be used to naturally progress to de Bruijn graphs of multiple kmer sizes and I found that quite elegant.

2. Minimizer

On the representation of de Bruijn Graph

KMC 2: Fast and resource-frugal k-mer counting

Chikhi et al’s merger of assembly and minimizer was quite elegant. The aspects of KMC2 paper I found most interesting were – (i) treatment of homopolymer minimizers to adjust bucket sizes, (ii) highly parallel code.

3. Cache-efficient common kmer finding in DALIGNER

DALIGNER: Fast and sensitive detection of all pairwise local alignments

In DALIGNER paper, Gene Myers came up with a cache-efficient method for finding common k-mers from two reads, and the method scaled almost linearly with the number of cores. I found that quite elegant. Also, his method to use Pacbio stats to cut down unlikely paths in O(ND) alignment to make it useful for long reads was very good.


The above is only a subset of a large number of unusual contributions made by bioinformatics researchers. A bigger list of interesting papers on NGS can be found in ASHG/GA4GH Special – A Recap of Interesting NGS Algorithms of the Year. Please share in comment section any idea, paper, blog or teaching tool that you enjoyed.

Dear BGI, Please Stop ‘Revealing’ Any More ‘Insights’ :)

Long long time ago, titles of papers told you exactly what was inside. For example, Alexander Fleming’s Penicillin paper (1928) was titled – On the Antibacterial Action of Cultures of a Penicillium, with Special Reference to Their Use in the Isolation of B. influenzae, not “here is the greatest drug to save humanity”.

The practice continued until late 1990s, when the first few genome papers were named as banally as “Analysis of the genome sequence of the flowering plant Arabidopsis thaliana” or “The nucleotide sequence of Saccharomyces cerevisiae chromosome XV and its evolutionary implications.” Then, somewhere along the line, someone got the idea that the genome paper needed to be titled colorfully, and have to use the word ‘insight’ and other primates writing genome papers copied it. For a full list of insight-full genome papers, check –

Shocking Finding that a Genome by Itself Provides Little Insight


Clearly, the purpose of genome sequencing is to “provide insights”.

On the other hand, ‘reveal’ is the common word to title the press releases of genome papers and the practice likely started with Watson’s 2001 write-up –

Watson, J.D. 2001. The human genome revealed. Genome Research 11: 1803-1804.

Fast forward by thirtee years and we have BGI with their mastery of words and revealing insights. Here are the titles of latest BGI papers –

Comparative genomics reveals insights into avian genome evolution and adaptation

Two Antarctic penguin genomes reveal insights into their evolutionary history and molecular changes related to the Antarctic environment

Oh well ! Where do we go from here?

Alternative Splicing – the New Snake Oil to Explain ‘Human Complexity’

We have been noticing increasingly that various bioinformaticians are chasing alternative splicing in RNAseq data. That qualifies as harmless curiosity just by itself, but when one combines it with additional observation that various ENCODE and GTEx clowns are using alternative splicing to ‘explain’ ‘human complexity’, the situation gets worrisome. On the later point, look no further than a paper published by leading ENCODE clown Mike Snyder last year (covered in Human Transcriptome is Extremely Complex and Snyderome is the Most Complex of All).

The problem started in 2001, when the number of protein-coding genes in human genome fell far short of the expected 100,000 to only 25,000. That was comparable to 20,000 genes in measly worm. The observation deflated the ego of a few bad scientists, who had been looking for other explanations of human complexity ever since. Using alternative splicing to turn that 20,000 to 100,000 splice forms is one method, others being declaring the entire genome as ‘biochemically functional’ and talking incessantly about human long non-coding RNA as something marvelous. Sandwalk blog has a full list.

Here’s the latest list of the sorts of things that may salvage your ego if it has been deflated.

1. Alternative Splicing: We may not have many more genes than a fruit fly but our genes can be rearranged in many different ways and this accounts for why we are much more complex. We have only 25,000 genes but through the magic of alternative splicing we can make 100,000 different proteins. That makes us almost ten times more complex than a fruit fly. (Assuming they don’t do alternative splicing.)

2. Small RNAs: Scientists have miscalculated the number of genes by focusing only on protein encoding genes. Our genome actually contains tens of thousands of genes for small regulatory RNAs. These small RNA molecules combine in very complex ways to control the expression of the more traditional genes. This extra layer of complexity, not found in simple organisms, is what explains the Deflated Ego Problem.

3. Pseudogenes: The human genome contains thousands of apparently inactive genes called pseudogenes. Many of these genes are not extinct genes, as is commonly believed. Instead, they are genes-in-waiting. The complexity of humans is explained by invoking ways of tapping into this reserve to create new genes very quickly.

4. Transposons: The human genome is full of transposons but most scientists ignore them and don’t count them in the number of genes. However, transposons are constantly jumping around in the genome and when they land next to a gene they can change it or cause it to be expressed differently. This vast pool of transposons makes our genome much more complicated than that of the simple species. This genome complexity is what’s responsible for making humans more complex.

5. Regulatory Sequences: The human genome is huge compared to those of the simple species. All this extra DNA is due to increases in the number of regulatory sequences that control gene expression. We don’t have many more protein-encoding regions but we have a much more complex system of regulating the expression of proteins. Thus, the fact that we are more complex than a fruit fly is not due to more genes but to more complex systems of regulation.

6. The Unspecified Anti-Junk Argument: We don’t know exactly how to explain the Deflated Ego Problem but it must have something to do with so-called “junk” DNA. There’s more and more evidence that junk DNA has a function. It’s almost certain that there’s something hidden in the extra-genic DNA that will explain our complexity. We’ll find it eventually.

7. Post-translational Modification: Proteins can be extensively modified in various ways after they are synthesized. The modifications, such as phosphorylation, glycosylation, editing, etc., give rise to variants with different functions. In this way, the 25,000 primary protein products can actually be modified to make a set of enzymes with several hundred thousand different functions. That explains why we are so much more complicated than worms even though we have similar numbers of genes.

What, then, is the explanation of ‘human complexity’? The answer is none. Humans are no more complex than most other animals.

Getting back to original point, I would be very cautious about reading too much from alternative splicing data in RNAseq, unless backed by other biochemical confirmations.

Taiwanese Nanotech Paper Heading for Retraction, Professor/Student Punished

When we wrote about a controversial Nanopore paper last year (see Nanopore Sequencing by Divine Intervention?), we expected the authors to either receive multiple Nobel prizes or get elevated to the status of deity. The second option seemed more likely after additional details came out about the project (see Bizarro Details Coming out on ‘Nanopore Sequencing by Divine Intervention’ Story). Apparently the ‘lab’, where those cutting-edge experiments were done, did not have any suitable equipment and was located far from the university near a well-known temple.

National Chiao Tung University, whose motto is 知新致遠 崇實篤行 (“Learn New Knowledge and Reach Far; Honor the Truth and Work Hard”), did not appreciate being a laughing stock any more and pulled the plug on this project. Those, who understand Chinese, can check about this latest development at this link. The jumbled up text from google translate text is not worth reproducing and the picture makes one think the professor/student got some big reward, not kicked out of the university.


This is a terrible tragedy for our blog, because we will have to restrict ourselves with making fun of domestic clowns, such as the ENCODE leaders and the positivity lady.

Excellent Software Engineering in KMC

I did not have much respect for ‘Boosted’ C++ programmers in bioinformatics until I went through KMC code. KMC, or rather its latest version KMC2, is currently the fastest (and leanest) kmer counting program. A kmer counting program is conceptually very naive, and that means the faster programs are reflections of programmers’ knowledge about various hardware and data structure concepts. For example, Jellyfish uses compare-and-swap feature of modern processors to make hash table updates near concurrent. BFcounter and a number of other programs use Bloom filter, a data structure concept unrelated to bioinformatics.

KMC, the earlier program, was disk-based and memory-light just like Chikhi et al.’s DSK. However, KMC showed significant speed improvement over DSK by using extensive parallelization in various steps. KMC2 has further improvement over KMC due to its use of minimizer concept. In fact, KMC2 is also an improvement over MSPcounter, another minimizer-based kmer counting program, because it modifies minimizers to ‘signatures’ to avoid too many subreads going to the same bucket.

Those algorithmic contributions aside, KMC2 code is a pleasure to read. The code is organized into four subdirectories (kmer_counter, kmc_dump, kmc_dump_sample, kmc_api), among which ‘kmer_counter’ is most important. The following figure from the kmc2 paper displays the flow diagram of the code.


KMC code is multithreaded and uses a number of queues to handle parallelization. What I found quite amazing is that it completed kmer counting in about the same time another test code written by me barely went through a giant FASTQ file. That means the biggest bottleneck of KMC2 is in reading fastq file from the disk and not in actual processing and counting of kmers. I am working through various classes within the code, and as I proceed, the following blocks will be updated. If you have any further insight, please feel free to let me know in the comment section below.

kmer_counter.cpp: The main code is in kmer_counter.cpp. It defines the class ‘CApplication’ that calls various classes through CKMC class.

kmc.h – defines the CKMC class that parallelizes the code and calls the classes in the files below.

CKMC class is where all action takes place. So, let me describe it in a bit more detail.

i) The public function SetParams(CKMCParams &_Params) sets all parameters.

ii) The public function Process() does all work by synchronizing all threads, etc. This is the most important function in the entire program.

iii) Here is a complete list of all other private and public functions. The names are quite descriptive of what they do.

template class CKMC {
bool initialized;

CStopWatch w0, heuristic_time , w1, w2;

// Parameters (input and internal)
CKMCParams Params;

// Memory monitor and queues
CKMCQueues Queues;

// Thread groups
vector gr0_1, gr0_2;
vector gr1_1, gr1_2, gr1_3, gr1_4, gr1_5; // thread groups for 1st stage
vector gr2_1, gr2_2, gr2_3; // thread groups for 2nd stage

uint64 n_unique, n_cutoff_min, n_cutoff_max, n_total, n_reads, tmp_size, n_total_super_kmers;

// Threads
vector w_stats_fastqs;
vector*> w_stats_splitters;
vector w_fastqs;
vector*> w_splitters;
CWKmerBinStorer *w_storer;

CWKmerBinReader* w_reader;
vector*> w_sorters;
CWKmerBinCompleter *w_completer;

void SetThreads1Stage();
void SetThreads2Stage(vector& sorted_sizes);

bool AdjustMemoryLimits();
void AdjustMemoryLimitsStage2();

void ShowSettingsStage1();
void ShowSettingsStage2();


void SetParams(CKMCParams &_Params);
bool Process();
void GetStats(double &time1, double &time2, uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total, uint64 &_n_reads, uint64 &_tmp_size, uint64& _n_total_super_kmers);

kmer.h – defines classes Ckmer and CKmerQuake for handling Kmers without and with quality scores.

fastq_reader.h – FASTQ reader classes CFastqReader and CWStatsFastqReader. The second class is a wrapper for the first one created to perform multi-threading.

splitter.h – defines the class CSplitter that splits reads into variuous bins. This is where minimizers are computed.

kb_storer.h – defines the class CKmerBinStorer that stores bins of kmers.

kb_sorter.h – defined the class CKmerBinSorter which sorts kmers within bins.

kb_reader.h – defines the class CKmerBinReader, which reads bins from distribution phase.

kb_completer.h – defines the class CKmerBinCompleter, which complete the sorted bins and store in a file.

kb_collecter.h – defines the class CKmerBinCollector, which collects kmers belonging to a single bin.

The Tragic Demise of BLAST

BLAST is one of the most well-known bioinformatics programs. In fact, if I remove ‘one of’ from the previous sentence, that will not be a mistake. The scientists developing it received numerous well-deserved awards. For early history of how the program was written, please check the blog post – Once upon a BLAST .

Only the successful hypotheses and the successful experiments are mentioned in the text — a small minority — and the painful intellectual labor behind discoveries is omitted altogether. Time is precious, and who wants to read endless failure stories? Point well taken. But this unspoken academic pact has sealed what I call the curse of research. In simple words, the curse is that by putting all the emphasis on the results, researchers become blind to the research process because they never discuss it. How to carry out good research? How to discover things? These are the questions that nobody raises (well, almost nobody).


But that is all history from early 1990s. What most researchers do not know is that BLAST is dying at present in the hand of NCBI. It is not dying due to lack of funding, but from over-funding. Early innovators have all left and God knows who has taken over.

Last week, I downloaded the source codes of several bioinformatics programs we use frequently – BLAST, BWA, samtools, SOAPdenovo2, Minia, Trinity, KMC2, SPAdes, DAZZLER, etc. – and compiled each of them. NCBI BLAST turned out to be THE worst among all of them.

Here are the specific complaints –

(i) BLAST compiled to executable directory (ReleaseMT) of size ~890MB ! Why would an aligner need almost 1GB of space? For comparison, bwa compiles to only 1.4MB.

(ii) Configure and make of BLAST took over one hour in a reasonably high-end server. BWA compiled in exactly 12 seconds.

(iii) The BLAST executable directory seemed to have all kinds of ‘sophisticated’ modules, including unit tests and other junk. That is all fine and dandy, except (see next point) –

(iii) After one hour of waiting and getting my directory filled, I desperately looked for a ‘make clean’, but could not find such option. Every other program had ‘make clean’ to help me get back to the ground state. Why not BLAST?

If BLAST gets any more funding, it will turn into Obamacare website software and stop working altogether. NCBI should urgently place BLAST into github, start derivative projects (aka wasting money) under new names. BLAST does not need any further development and definitely no more funding to destroy it.

New Nanopore Paper by Phil Ashton

Readers of our blog are familiar with Philip Ashton’s excellent work (see here and here). He recently published a new paper on nanopore sequencing that you may enjoy. In twitter, it is being presented as the ‘first serious analysis of Nanopore data’.

MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island

Short-read, high-throughput sequencing technology cannot identify the chromosomal position of repetitive insertion sequences that typically flank horizontally acquired genes such as bacterial virulence genes and antibiotic resistance genes. The MinION nanopore sequencer can produce long sequencing reads on a device similar in size to a USB memory stick. Here we apply a MinION sequencer to resolve the structure and chromosomal insertion site of a composite antibiotic resistance island in Salmonella Typhi Haplotype 58. Nanopore sequencing data from a single 18-h run was used to create a scaffold for an assembly generated from short-read Illumina data. Our results demonstrate the potential of the MinION device in clinical laboratories to fully characterize the epidemic spread of bacterial pathogens.

The paper has two interleaved stories – one on Oxford nanopore technology and one on genome modification in bacteria leading to antibiotic resistance. We will cover the MinION part here and elaborate on the biological and medical aspects of antibiotic-resistant bacteria in a later blog post.

Speaking of nanopore, the most important contribution of the paper is its estimation of error profile. We will mention several key observations in the following sections.

1. Use only 2D reads in the future

MinION generates three types of reads – template, complement and 2D. Among them, 2D reads use data from two measurements of the same read to reduce the error rate. Based on our understanding of the paper, it will be safe to ignore the first two categories due to low accuracy and use only the 2D reads.

2. Accuracy is 95%, 85%, 10%, 72%

When it comes to the accuracy of Oxford nanopore reads, we have seen all kinds of numbers in the past, ranging from Mikheyev’s 10% to 95%. If this controversy appears too similar to ENCODE’s estimate of functionality of human genome, that is not a coincidence, given that Ewan Birney advises both organizations.

The authors have spent good deal of time on this issue and arrived at ~72% for 2D reads. Interestingly, instrument’s own Phred estimates for the same reads are too high at 84.6% and should be revised. What does not make sense is why accuracy improves to only 72% for 2D from ~62% for ‘1D’ reads. A simple probabilistic calculation would have given 1-(.38^2) or 85% accuracy for 2D. Our naive explanation is that a large fraction of ‘1D’ reads does not get mapped, whereas the mappability improves after 2D correction.

3. Use LAST aligner, not BLAST or BWAMEM

Authors reported LAST aligner working well for ONT reads, but not BLAST or BWAMEM aligners. Why BWAMEM did not work is not clear to us given that it has a setting of ‘-x ont2d’ pegging accuracy at around 65% (corrected, see below). Have no fear – all alignment-related matters are being looked into by Ewan Birney, whose PhD thesis was on ‘Sequence alignment in bioinformatics’. We will update, as soon as we hear from him.

Heng Li mentioned that he added ‘-x ont2d’ option in bwa-0.7.11, whereas it was not available in bwa-0.7.10 used by Ashton.

With ‘-x ont2d’, the BWAMEM alignements are as good as LAST for most cases (with even identical alignment scores) except for a tiny fraction of reads.

4. Detailed error profile

The section on detailed error profile is quite informative. Indels are the most common errors just like in Pacbio, and it appears that ONT has difficulty differentiating Gs and Cs in the homopolymer regions. The entire section is presented below.

The error profile of the MinION H125160566 data set was characterized. In the 70.8 megabases of aligned nanopore sequence an indel occurred every 5.9 bases on average. Two-thirds of these indels were deletions (8.6 Mbp), whereas one-third were insertions (4.15 Mbp). The mean deletion length was 1.7 bp, whereas the mean insertion length was 1.6 bp. Both insertion and deletion lengths had a negative exponential distribution (Supplementary Figs. 3 and 4).

A z-score was calculated to identify k-mers that were deleted from mapped MinION reads at a higher-than-expected frequency. When z-scores for deletions of 3–6 bp were binned by GC content, the tendency of the MinION to skip k-mers containing As and Ts only or Gs and Cs only could be seen (Fig. 1). In particular, there were two 3-mers with a z-score > 3, GGG (z = 3.3) and CCC (z = 3.2), two 4-mers with a z-score > 3, GGGG (z = 4.4) and CCCC (z = 4), and four 5-mers with a z-score > 4, AAAAA, TTTTT, CCCCC and GGGGG. Two more diverse k-mers that had high z-scores were TAGGCA and TAGGGC, with z-scores of 9.6 and 9.5, respectively. When insertions were examined there were two 3-mers with a z-score > 3, CCC (z = 3.7) and GGG (z = 3.6), and two 4-mers with a z-score > 3, CCCC (z = 6.1) and GGGG (z = 6.0).

To identify substitutions, all nonconsensus bases were extracted from the MinION alignment to the Illumina assembly. This showed that A to T and T to A substitutions were approximately half as frequent as other substitutions (Supplementary Table 2). These data suggest that MinION currently has difficulty differentiating Gs and Cs, particularly when these bases are present in homopolymeric tracts.

5. Lex Nederbragt?

Lex Nederbragt became untouchable by mocking the company telling the truth in his blog and did not get an early access to MinION. We tried our best in our petition to request early access to Minion nanopore for Lex Nederbragt’s group that was not approved by the company. Therefore, it was encouraging to see him contributing to the ‘first serious Nanopore paper’. From the acknowledgement section –

We would also like to thank L. Nederbragt for a thorough review and contributions toward the presentation of Figure 1.

6. Biological discovery – was nanopore sequencing useful?

Speaking of biological discovery, it was not clear how much of authors’ finding of the position and structure of a bacterial antibiotic resistance island came from technological help from Nanopore sequencing and how much came from their own ingenuity. They needed Illumina sequencing to get their work done and used nanopore reads to only put the pieces together. The biggest contribution of nanopore is possibly this one –

The assembly of the Illumina data collapsed these two genes into a single copy, which represented a misassembly that confounded analysis and was only resolved using the MinION reads.

Another question – how can one resolve the assembly failures such as hyp and merA mentioned below?

To demonstrate the utility of MinION sequence for hybrid genome assembly, the Illumina and Minion data for the H125160566 strain were assembled using SPAdes18. This method resolved the genome into 34 contigs, with an N50 of 319 kbp, whereas the Illumina-only assembly produced 86 contigs with an N50 of 154 kbp. The SPAdes hybrid assembly confirmed the yidA insertion site but failed to resolve the complete island structure with contig breaks in the hyp and merA genes, presumably due to the low coverage regions in the Illumina data. SPAdes hybrid assembly demonstrated significant improvement over Illumina-only assembly—further improvements would be possible given higher-coverage MinION data.

As we see it, the failure was not due to short or long reads, but rather from GC coverage bias of Illumina Nextera sequencing. So, the only way to solve the problem is to go to some bias-free sequencing technology, unless the lab is ready to do PCR and Sanger sequencing on every assembled genomic segment that is suspect.

Overall, this paper presents the current status of Oxford Nanopore sequencing well, and will be valuable to bioinformaticians for its detailed profiling of error distribution.



Nick Loman posted a comparison table (columns – # aligned, % aligned, mean alignment length) of aligning nanopore R7.3 chemistry to E. coli reference. It shows no difference between ‘-x ont2d’ and ‘-x pacbio’ alignment settings, suggesting that the error rates of Pacbio and ONT2D are similar for this new chemistry. Please note that Ashton et al. likely used an earlier chemistry.


Gene Myers Needs Postdoc, Gets Endorsed by Legendary Evolutionary Biologist Ewan Birney

Those looking for post-doc opening in bioinformatics should not miss the following opportunity at Gene Myers’ lab.

Dazzling Postdoc?

I am looking for a postdoc to work directly with me, Gene Myers, in Dresden, Germany on algorithms and software for a next-next-gen assembler for long read sequencers. The required attributes are:

Training and expertise in the area of combinatorial algorithms (e.g. CPM, WABI, RECOMB)
Great coding skills in C and an interest in “algorithm engineering”
A desirable additional factor would be direct experience with DNA sequence assembly algorithms.

My group is at the Max-Planck Institute for Molecular Biology and Genetics in Dresden, Germany. It is primarily an interdisciplinary algorithms/software team with a smaller subgroup that builds novel microscopes. See Myers Lab Homepage.


As an aside, Ewan Birney, who is famous for showing 80% of human genome to be functional (at a cost of $200M) has been following Myers’ new efforts with keen interest. Apparently, Gene Myers plans to assemble a 32Gb genome, which, at 10x size of human genome, could be a $2B opportunity for Birney and ENCODE to look for ‘functional elements’.


Searching and Indexing Genomic Databases via Kernelization

Rayan Chikhi reported about a new algorithm for searching through multiple genomes using LZ77 (Lempel/Ziv) compression. It was presented by S. Puglisi at the Data Structures in Bioinformatics workshop.
Those not attending the conference can check this paper to know what it is about. (h/t: @pmelsted). Readers may recall that Simon Puglisi is an author of another outstanding paper that we reported about a month back (Alex Bowe’s – Variable-Order de Bruijn Graphs).


The rapid advance of DNA sequencing technologies has yielded databases of thousands of genomes. To search and index these databases e ffectively, it is important that we take advantage of the similarity between those genomes. Several authors have recently suggested searching or indexing only one reference genome and the parts of the other genomes where they diff er. In this paper we survey the twenty-year history of this idea and discuss its relation to kernelization in parameterized complexity.

Data Structures in Bioinformatics Workshop

Rayan Chikhi posted web link for a workshop currently taking place in France and the talks appear quite fascinating. Even if you cannot attend, you may be able to go through the papers of the following researchers to see what they are up to.

Simon Puglisi, Univ Helsinki, Finland
Title: Scalable Indexing of Highly Repetitive Data

Abstract: Advances in DNA sequencing mean databases of thousands of human genomes will soon be commonplace. This talk will introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we preprocess the text with LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show this also significantly reduces query times.

Pall Melsted, University of Iceland, Iceland, [email protected]
Title: Bloom filters for fun and profit

Abstract: A Bloom filter is a simple, fast and memory efficient probabilistic data structure for storing a set. The Bloom filter has a one sided error, in the sense that it can give false positive answers to membership, but the error rate is controlled by the amount of memory used. We will review the use of Bloom filters in applications such as k-mer counting, assembly and error correction for HTS datasets. Additionally we will cover some variations of Bloom filters and introduce a new variant with better memory efficiency for storing k-mers.

Guillaume Rizk, INRIA, France, [email protected]
Title: k-mer counting with minimizers

Abstract: K-mer counting is a recurrent step in many NGS algorithms ( assembly, read error correction, …). Current approaches are either ram-intensive (Jellyfish), or disk-based with huge amount of input/output to disk (DSK, KMC). A new disk-based method based on minimizers introduced by Deorowicz et al., reduces the amount of I/O by an order of magnitude, providing both memory frugal and fast k-mer counting. The GATB library (Genome Assembly & Analysis Tool Box now includes an implementation of k-mer counting based on those ideas. This talk will detail the counting procedure, and comparative results.

Nicola Prezza and Alberto Policriti, Univ. Udine, Italy
Title: Hashing and indexing with succinct hash data structures

Abstract: In this talk, we will begin by illustrating how we integrated hashing and indexing in building an aligner (the ERNE aligner) used to map short patterns (reads) on a long text (reference) with a limited amount of errors. Our hash-based strategy builds upon two families of hash functions. The first, named Hamming-aware, permits to considerably reduce search space by hashing the entire Hamming ball around the searched pattern to a smaller one. The second, named de Bruijn, allows to obtain a novel kind of succinct hash data structure whose space requirements are of n+o(n) bits instead of O(n log n). We will conclude showing how we obtained new space-time bounds for the best k-mismatch problem by finding a hash function with both the above described properties.

Johannes Köster and Sven Rahmann, Univ. Essen, Germany
Title: Massively parallel read mapping with the q-group index and PEANUT

Abstract: We present the q-group index, a novel data structure for read mapping tailored towards graphics processing units (GPUs) with a small memory footprint and efficient parallel algorithms for querying and building. On top of the q-group index we introduce PEANUT, a highly parallel GPU-based read mapper.

Bastien Cazaux, Thierry Lecroq and Eric Rivals
Title: From indexing data structures to de Bruijn graphs

Abstract: New technologies have tremendously increased sequencing throughput compared to traditional techniques, thereby complicating DNA assembly. Hence, assembly programs resort to de Bruijn graphs (dBG) of k-mers of short reads to compute a set of long contigs, each being a putative segment of the sequenced molecule. Other types of DNA sequence analysis, as well as preprocessing of the reads for assembly, use classical data structures to index all substrings of the reads. It is thus interesting to exhibit algorithms that directly build a dBG of order k from a pre-existing index, and especially a contracted version of the dBG, where non branching paths are condensed into single nodes. Here, we formalise the relationship between suffix trees/arrays and dBGs, and exhibit linear time algorithms for constructing the full or contracted dBGs. Finally, we provide hints explaining why this bridge between indexes and dBGs enables to dynamically update the order k of the graph.

Gregory Kucherov, Univ. Marne La Vallée, France
Title: Improved filters for the approximate suffix-prefix overlap problem

Abstract: The approximate suffix-prefix overlap problem is to find all pairs of strings from a given set such that a prefix of one string is similar to a suffix of the other. This computation is a basic operation in genome assembly (overlap-layout-consensus approach) and in read correction (multiple sequence alignment approach). Valimaki et al. (Information and Computation, 2012) gave a solution to this problem based on suffix filters. In this work, we propose two improvements to the method of Valimaki et al. that reduce the running time of the computation.
This is a joint work with Dekel Tsur (Ben-Gurion University).

Thierry Lecroq, Univ. Rouen, France
Title: Fast on-line pattern matching algorithm for highly similar sequences

Abstract: With the advent of NGS technologies there are more and more genomic sequences of individuals of the same species available. These sequences only differ by a very small amount. There is thus a strong need for efficient algorithms for performing fast pattern matching in such specific sets of sequences. We will present very efficient algorithms that solves the on-line exact pattern matching problem in a set of highly similar DNA sequences. The algorithm we propose extends the Morris-Pratt and Knuth-Morris-Pratt algorithms and variants of the Boyer-Moore exact string matching algorithm. Experimental results show that our new algorithms exhibit good performances in practice.
This is a joint work with N. Ben Nsira and M. Elloumi.

Solon Pissis, Kings College, London
Title: Accurate and efficient methods to improve multiple circular sequence alignment

Abstract: Multiple sequence alignment is a core computational task in bioinformatics and has been extensively studied over the past decades. This computation requires an implicit assumption on the input data: the left- and right-most position for each sequence is relevant. However, this is not the case for circular structures; for instance, MtDNA. Efforts have been made to address this issue but it is far from being solved. We have very recently introduced a fast algorithm for approximate circular string matching (Barton et al., Algo Mol Biol, 2014). Here we first show how to extend this algorithm for approximate circular dictionary matching; and, then, apply this solution with agglomerative hierarchical clustering to find a sufficiently good rotation for each sequence. Furthermore, we propose an alternative method that is suitable for more divergent sequences. We implemented these methods in BEAR, a programme for improving multiple circular sequence alignment. Experimental results, using real and synthetic data, show the high accuracy and efficiency of these new methods in terms of the inferred likelihood-based phylogenies.
This is a joint work with Carl Barton, Costas S. Iliopoulos, and Fatima Vayani.

Djamal Belazzougui, Univ Helsinki, Finland
Title: BW4SA A sequence analysis library based on the Burrows-Wheeler transform

Abstract: Most sequence analysis problems can be solved with the help of the suffix tree, probably the most important and well-known text indexing data structure. The main problem with the suffix tree is its excessive space usage. Given a text T length n over an alphabet of size sigma a suffix tree occupies O(n log n) bits of space while the text itself occupies only nlog sigma bits of space. The suffix tree can be used to the sequence analysis problems in O(n log sigma) time. The compressed suffix tree (Sadakane TOCS 2007) is a relatively new data structure which uses only O(nlogσ) bits of space, and can simulate any algorithm on the suffix tree at the price of a slowdown by a factor O((logn)ϵ) for any constant ϵ>0. Recently, Beller, Berger, Gog, Ohlenusch and Schnattinger (SPIRE 2011,2012, JDA 2013) presented a general algorithm based on Burrows-Wheeler transform (BWT) that can be used to find maximal repeats and solve other sequence analysis problems in O(nlogsigma) time and bits of space, thus achieving the speed of the suffix tree and the space-efficiency of the compressed suffix tree. The general idea of the new algorithm is to enumerate the nodes of the suffix tree by navigating the suffix link tree (SLT). In a recent paper, Belazzougui, Cunial, Karkkainen and Makinen (ESA 2013) have shown a different algorithm to traverse the SLT the O(nlogσ) time, given the bidirectional BWT of a sequence and use it to solve many other sequence analysis problems. In this talk i will present BW4SA (Burrows-Wheeler for sequence analysis) a sequence analysis library that implements many of the algorithms presented in (ESA 2013).
This is a joint work in progress with Fabio Cunial. Implementation by Topi Paavilainen, Paula Lehtola, Max Sandberg, Lassi Vapaakallio, Jarno Alanko and Ahmed Sobih.

Alice Heliou, Polytechnique, LIX, Paris
Title: Linear-time Computation of Minimal Absent Words Using Suffix Array

Abstract : An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. In this talk we will present the first O(n)-time and O(n)-space algorithm for computing all minimal absent words based on the construction of suffix arrays. Experimental results, using real and synthetic data, show that our implementation outperforms the existing one.
This is a joint work with Carl Barton, Laurent Mouchard and Solon Pissis.

Paola Bonizzoni, Università Degli Studi di Milano-Bicocca, Italy
Title: An external memory BWT-based assembler for Next Generation data

Abstract: The large amount of short read data that are needed for assembling in some fields (e.g. metagenomic) strongly motivates the investigation of disk-based approaches to indexing reads. Positive results in this direction can stimulate the investigation of efficient disk-based algorithms for de novo assembly of reads. We developed a disk based-algorithm for building string graphs and we have integrated our implementation into SGA (the most well-known assembler based string graphs). Besides the integration with an existing assembler, the main advantages of our algorithm are (1) a main memory requirement independent of the size of the data set, and (2) a running time that is competitive with the (memory-based) one of SGA. We have performed an experimental analysis to determine the current limitations of an external memory assembler, starting form the state of the art of building an FM-index and LCP, Suffix Array in external memory.
This is a joint work with Marco Previtali, Università Degli Studi di Milano-Bicocca, Italy.

Guillaume Holley, Univ. Bielefeld, Germany
Title: Bloom Filter Trie – a data structure for pan-genome storage

Abstract: High Throughput Sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species. The concept of pan-genome has therefore emerged. It is composed of two parts: The core genome, which represents the pool of all the genes shared by all the strains of the species, and the dispensable genome, which represents genes not shared by all strains. Graphs exhibit good properties for representing a pan-genome, as they avoid storing redundant information and can represent genomic variants. Unfortunately, existing tools using this representation are slow or memory consuming. We present here a new data structure for storing pan-genomes: The Bloom Filter Trie. This data structure allows to store a pan-genome as an annotated de-Bruijn graph in a memory efficient way.
This is a joint work with Roland Wittler and Jens Stoye from International Research Training Group 1906/1 and Genome Informatics, Faculty of Technology, Bielefeld University, Germany.

Alexander Schoenhuth, CWI, Amsterdam, Holland
Title: Genome sequence assembly via maximal clique enumeration

Abstract: Genomes of unknown ploidy and high diversity, such as those of many pathogens, can harbor haplotypes of (very) low frequency. While one can address this by deep sequencing, the elevated error rates that apply for next-generation sequencing can interfere with the low frequency of the variants inherent to the low-frequency haplotypes. New ideas for error correction and haplotype-aware assembly are required when assembling such highly diverse, polyploid genomes. We present a new method, HaploClique, which is based on enumerating maximal cliques in overlap-like graphs as an underlying principle. In these graphs, nodes reflect probability distribution over sequences, which at the beginning are given by the reads, and edges reflect that two such distributions are likely to have arisen from the same underlying base sequence, i.e. from (locally) the same haplotype. Maximal cliques then reflect maximal ensembles of reads likely to stem from the same haplotype. One then iterates by transforming max-cliques into nodes and drawing the corresponding edges among those new nodes, which results in growing, error-corrected haplotype contigs. As proof of principle, we show that we can reliably assemble HIV genomes – we outperform all prior, state-of-the art methods that aim at global assembly of highly diverse, polyploid genomes.
This is joint work with Tobias Marschall, MPI Informatics, Saarbruecken, Armin Toepfer and Niko Beerenwinkel from ETHZ, Basel.