New algorithms for short read assembly (categories B and D) often use de Bruijn graphs to store and represent sequence data. What is a de Bruijn graph and why is it so popular for analyzing short read sequences? We will explain the concept here.

De Bruijn graph is an efficient way to represent a sequence in terms of its k-mer components. Although de Bruijn graphs can be used for a broad range of problems, our discussion will be limited to nucleotide sequences. Most papers talk about constructing de Bruijn graphs from short reads and derive the genome sequence from the de Bruijn graph. For simplicity, here we will first introduce de Bruijn graph of a genome, and then explain how short reads fit into the picture.

A de Bruijn graph can be constructed for any sequence, short or long. Here is a simple example –

In the above example, the sequence ATGGAAGTCGCGGAATC is split into overlapping K-mers (K=7), and a directed graph is constructed with those 7-mers as nodes. Edges are drawn between 7-mers adjacent on the original sequence. This method of construction also ensures that connected nodes have overlaps of 6 (=K-1) nucleotides.

The above example is simple, because none of the 7-mers repeated within the original sequence. In the next example, we introduce some repetition. The 5′-most 7-mer in this example is identical to the 3′-most 7-mer (both marked in blue). The de Bruijn graph forms a loop in this case.

Although the nodes displayed in the above examples did not show sequences for both strands, in reality, each node is double-stranded as shown below –

In the above example, 3′-most 7-mer is the reverse complement of the 5′-most 7-mer. The links in the double-stranded de Bruijn graph were constructed accordingly.

The steps shown above can be repeated to construct de Bruijn graph of a large genome of any size. We note that even if a genome has 2 separate chromosomes, the de Bruijn graphs may not remain separate, if those chromosomes have overlapping K-mers as shown below.

All of the above examples considered K=7, but K can be any small or big integer. At its lowest extreme, K can be 1. However, a K=1 de Bruijn graph is not very useful as you can see from the following figure.

The above discussion introduces de Bruijn graphs for our purpose. Why they are so popular for genome assembly programs will be explained in the following post. Before closing, we will note few more points about de Bruijn graphs that will be helpful in our future discussions.

1. Given any sequence and K-mer size, we can create a de Bruijn graph in an unique manner.

2. The other direction is not true. Any de Bruijn graph cannot be resolved into an unique sequence. Unless the de Bruijn graph is in its simplest form, it usually resolves into many possible sequences.

3. Larger the K-mers, more easy it is to convert the de Bruijn graph into an unique sequence.

4. Higher K-mer size generally requires more computer memory to store and process the graph. Therefore, how much RAM a machine has sets the upper limit for value of K.

Additional posts on the same topic –

How do sequencing errors affect de Bruijn graphs?

De Bruijn Graphs for Alternative Splicing and Repetitive Regions

A Drawback of de Bruijn Graph Approach

Please post your comments on the above topic here.

[…] de Bruijn graphs – I de Bruijn graphs – II de Bruijn graphs – III […]

[…] you remember our earlier discussion on de Bruijn graphs, we first constructed de Bruijn graph of a genome and then showed that de Bruijn graph constructed from short reads is nearly identical to the de […]

[…] De Bruijn graphs – I […]

[…] De Bruijn graphs – I […]

[…] Bruijn Graphs – III In earlier commentaries, we introduced the concept of de Bruijn graphs and showed how they were used for de novo assembly of short read sequences. If you read the posts, […]

[…] De Bruijn graphs – I […]

[…] De Bruijn graphs – I De Bruijn graphs – II De Bruijn Graphs – III How do sequencing errors affect de Bruijn graphs? De Bruijn Graphs for Alternative Splicing and Repetitive Regions A Drawback of de Bruijn Graph Approach De Bruijn Graphs for Alternative Splicing and Repetitive Regions Using Mate Pair Information in de Bruijn Graphs Contrail – A de Bruijn Genome Assembler that uses Hadoop Watching a de Brujin Graph Assembler in Action (Contrail) Updating a de Bruijn Assembler Code for Color Space Data in Six Easy Steps […]

[…] us consider the problem of deriving de Bruijn graph of a genome. Suppose you have a large mammalian genome and you like to build a de Bruijn graph from the genome […]

[…] of our earliest commentary on de Bruijn graphs contained the following graph […]

[…] nearly identical chromosomes, but almost all de Bruijn graph-based assemblers discussed previously (De Bruijn graphs – I, De Bruijn graphs – II, De Bruijn Graphs – III, remaining de Bruijn graph series) start with […]

[…] The best way to visualize de Bruijn graph structure is to start from the genome and draw the graph based on k-mers. The only difference with previous examples is that to account for haplotype difference, you need […]

[…] we wrote extensively on the basic working of de Bruijn graph-based assemblers (here, here, here, here, here). First generation assembly algorithms could be easily understood from […]

[…] question is important, you need to first look at how de Bruijn graph-based genome assemblers work (here, here, here). They assemble short reads by breaking them into even shorter pieces, forming a giant […]

[…] did we proceed to understand the code? We have some understanding of de Bruijn graph-based assembly algorithms, and we also looked into the codes of few other assemblers. With that knowledge, we decided to dig […]

[…] De Bruijn graphs – I De Bruijn graphs – II De Bruijn Graphs – III How do sequencing errors affect de Bruijn graphs? De Bruijn Graphs for Alternative Splicing and Repetitive Regions A Drawback of de Bruijn Graph Approach De Bruijn Graphs for Alternative Splicing and Repetitive Regions Using Mate Pair Information in de Bruijn Graphs Contrail – A de Bruijn Genome Assembler that uses Hadoop Watching a de Brujin Graph Assembler in Action (Contrail) Updating a de Bruijn Assembler Code for Color Space Data in Six Easy Steps […]

[…] not need to be restricted to nucleotide sequences. There is one important difference though. In this article, we showed that each node is formed with a K-mer sequence and its reverse complement. Reverse […]