Archives

Categories

Finding us in homolog.us III (Search Algorithm and Indexing)

If you find our commentaries useful, please feel free to press the donate button on the right sidebar, and your generous contribution will be acknowledged in the table at the bottom of the page.

You can follow us on twitter – @homolog_us.

-------------------------------------------------------------------------------------------------------------

Today we shall review the indexing step of search algorithms and explain how Burrows-Wheeler transform-based algorithms (Bowtie, BWT, SOAP2) run significantly faster than alternatives like MAQ. The discussion will remain here at an abstract level, and we shall get into the details in a later commentary.

Almost all search algorithms in bioinformatics have two steps – (i) indexing of genome (reference) and (ii) searching on the indexed genome.

For BLAST, the indexing step is performed through ‘formatdb’ operation (improved and renamed as makeblastdb in BLAST+). Bowtie uses ‘bowtie-build’ command to build the index. Other popular programs like MAQ and BWA also have indexing steps. What does indexing do? Here we present the idea at a conceptual level.

Imagine you have a pile of books and you need to find the one on ‘Homolog.us’. You go from top to bottom of the stack and find the book that you need.

The above task is simple, if one has only ten or twelve books. Few years back, we visited the house of a retired Stanford professor in Palo Alto. Although it was not a very large place, he managed to keep 10,000 books there. Every imaginable place we looked at had boxes full of books. Even the bathroom had a large shelf with over 100 books.

With that many books, he could not have done the simple process of going through the entire pile, whenever he needed a book. People utilize two steps to speed up the search process.

i) All books are sorted in alphabetical order of titles.

See how easy it is to find ‘Homolog.us’ now.

ii) Sorting is not enough, if those 10,000 books are stored in bedroom, bathroom, kitchen, laundry room and basement. A catalog needs to be created to list –
‘A’ books are in bedroom,
‘B’ books are in living room top shelf, etc.

In a bigger pile of books, you can save two letters – ‘AA’, ‘AB’…’TA’..and so on. This is indexing, and it can significantly speed up the search process.

Note that indexing and sorting are two different steps, and one can employ one or both to improve the search process. Even if one indexes an unsorted pile of books, that will result in significant improvement in search time than going through the entire pile of books. This is an important point, because genome sequences cannot be sorted into nucleotides, or one gets a meaningless series of ‘A’s followed by ‘C’s, ‘G’ and ‘T’s. So, the best one can do is index all regions of the genome and store locations of various K-mers. MAQ algorithm performs searches through this strategy.

Why do Burrows Wheeler transform-based algorithms run significantly faster than MAQ? Because they do allow combining of sorting and indexing together. If you remember our earlier commentary, the first column of BW transform gets the sorted genome. That, by itself, is meaningless for searches. However, when it is combined with the last column (BWT of genome), the entire sequence can be reversibly constructed.




-------------------------------------------------------------------------------------------------------------
Heroes and Heroines of New Media--2014

I am strongly influenced by Charles Hugh Smith, who runs his insightful social blog of Two Minds. I hope he will not mind, if I copy his style of acknowledgement to the supporters of our blog.

Our blog is deeply honored by the generous contribution of the following readers. Without their patronage, this site would go away.

Outstandingly Generous:   
Amemiya C. Schnable J. Bowman B. Osipowski P.
Shen M.      

We are also looking for subscribers to get help to finish the tutorials. Please see this post for details.

5 comments to Finding us in homolog.us III (Search Algorithm and Indexing)

  • Nils Homer

    From the MAQ paper: “The algorithm MAQ uses to find the best hit is quite similar to the one used in Eland. It builds multiple hash tables to index the reads and scans the reference sequence against the hash tables to find the hits.”[1]

    So MAQ indexes the reads, not the reference. Of course, if we use the reference multiple times, then indexing the reference once is preferable to indexing the reads, since we have to index each new read dataset. Otherwise, it depends on the two dataset sizes. SHRiMP2 for example indexes the reference, while it predecessor indexes the reads.

    So indexing the reads is worse than indexing the reference, but FM-index algorithms (compressed suffix arrays with BWT) may not be faster than hashes of k-mers and suffix arrays, but an FM-index certainly uses less memory.

    [1] http://genome.cshlp.org/content/18/11/1851.full

  • admin

    You are correct about MAQ indexing reads. Thanks for pointing out. It does make more sense to index the reference that is being used again and again, but maybe MAQ guys were thinking about splitting the reads into small batches and running in parallel.

    I generally agree with your last sentence. ‘Hashing’ is the operative word here. Here is an interesting analogy with the ‘pile of book’ example of the main post. I have 11 books scattered on my desk or in my bag right now. Even though they are not sorted or arranged in any order, I can find out any of them without any pain. Why? I have been using them so often that their locations are ‘hashed’ in my brain.

  • […] us in homolog.us Finding us in homolog.us – part II Finding us in homolog.us III (Search Algorithm and Indexing) Burrows Wheeler transform – Suffix Arrays and FM Index Data Compression […]

  • […] introduced the concept of indexing in a previous post. Today we shall explore indexing for word search […]

  • […] us in homolog.us Finding us in homolog.us – part II Finding us in homolog.us III (Search Algorithm and Indexing) Burrows Wheeler transform – Suffix Arrays and FM Index Data Compression Algorithms Suffix […]

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>