Archives

Categories

Generalizing the Burrows-Wheeler Transform

If you find our commentaries useful, please feel free to press the donate button on the right sidebar, and your generous contribution will be acknowledged in the table at the bottom of the page.

You can follow us on twitter – @homolog_us.

-------------------------------------------------------------------------------------------------------------

We recovered this paper from a ftp site and do not know whether it is published in any journal. Ge Nong , who did very good work on Burrows Wheeler transform, is an author. The text in bold (emphasis ours) caught our attention. Did anyone take a look at Schindler Transform?

The Burrows-Wheeler Transform (BWT) has been long known as a superior reversible text transform method based on block sorting strategy. Among all, the Schindler Transform (ST) is the most remarkable variant of BWT for it dramatically boosts the transform speed using a partial context sorting scheme. The inverse ST currently requires the use of a hash table, which is not indispensable for the inverse BWT. In this paper, we show that both the BWT and the ST can actually be fit into a unified algorithm framework in the sense that not only can the BWT and the ST be performed using a same algorithm, but also can the inverse BWT and the inverse ST be performed similarly without any hash table. Specifically, we prove several theorems and derive from them two efficient algorithms for inverting the ST. The proposed algorithms solve the inverse ST with a time complexity of O(kN) and a space complexity of O(N), where N is the text’s length, k 2 [1; N] is the context order of the transform and is typically very small compared to N in practice. As a result, the algorithms used for the inverse BWT has been successfully generalized to solve the inverse ST.

The relevant paragraph on Schindler Transform is included below.

The Schindler Transform

Obviously, the need of sorting all the contexts up to their full lengths of N is the main cause for the super-linear time complexity of BWT. Inspired by this observation, Schindler proposed11 in [10], [20] a slightly different transform which can sort the texts by using only their first k characters (where k can be a value far less than N), but still render itself reversible. The key idea of ST is a two-hierarchy priority sorting scheme, i.e. the lexicographical sorting criterion in tandem with the positional sorting criterion. To be more specific, let OM be the same matrix as defined for the BWT, using the k-order ST, OM is transformed to Mk by sorting all its rows according to their first k left characters only, i.e. the k-order contexts. In case that any two k-order contexts are equal, the tie is resolved by their positions in the original OM. This two-level sorting scheme can be done using radix sort as follows.

Edit.

We found the reference of original paper.

Capture2

and some more info on Dr. Chan’s research

From Burrows-Wheeler Transform to Suffix Array Construction

Speaker: Dr. Daricks W.H. Chan
Time: 2:00pm-3:00pm, May 8th, 2013
Venue: E202
Abstract:
In 1994, Burrows and Wheeler presented a block-sorting based transform in their technical report [Burrows and Wheeler 1994], known as the BWT, which permutes a text into a new sequence that usually is more ¡°compressible¡±. Briefly, given a text S of N characters, the BWT can be performed in three steps: (1) to derive a matrix consisting of N rotations (cyclic shifts) of S; (2) to sort the rows of the matrix lexicographically; and (3) to extract the last column of the sorted matrix to produce the transformed text. The BWT is lossless and reversible. There exists an efficient algorithm which can correctly restore the original text S from the transformed text with a time/space complexity of O(N). Therefore, when N is large, any naive sorting scheme will still impose an evident bottleneck on the BWT.

One drawback of the BWT in its original form is its time and space complexities. Any naive implementation of the second step that utilizes general purpose comparison-based sorting algorithms could result in a O(N^2 log N) worst case running time complexity. If a traditional radix sort algorithm is used, it requires O(N^2). One solution proposed by Schindler in [Schindler 1997; 2001], known as ST, to speed up the block sorting is to sort only the first k columns of the matrix in order to reduce the complexity to O(kN). A major tradeoff for the ST to achieve the speedup gain over the BWT is that the inverse ST is far more complicated than the inverse BWT. In Schindler’s patent disclosure [Schindler 2001], the hash table based inverse ST algorithms were given only for the orders of 1 and 2. In [Chan and Nong 2006] and [Nong, Zhang and Chan 2011], we show that the time complexity of the inverse ST is independent of k and gave a linear time algorithm to the inverse ST.

Our study was then extended to the construction of suffix array SA(S) of the text S. SA(S) is an array of pointers for all the suffixes in the S sorted in the lexicographically ascending order. It is clear that BWT and the suffix array construction on S is equivalent. In the talk, I will also introduce the development of linear suffix array construction algorithms in the last decade.

CV of Dr. Chan

Dr. Daricks W.H. Chan is now serving the Department of Mathematics and Information Technology of the Hong Kong Institute of Education as an assistant professor. Before joining HKIEd, Dr. Chan taught in the Department of Mathematics of Hong Kong Baptist University, and he had been teaching mathematics subjects in various levels for over ten years. He obtained his doctorate from the HKBU in 2003, and the postgraduate diploma in education from the CUHK in 1999. His areas of expertise are algorithm design, quantum information, graph theory and combinatorics. Dr. Chan has published over fourty papers in international journals including SIAM, IEEE, Physical Review and Discrete Applied Mathematics. He was awarded research and teaching grants of total amount over two million Hong Kong dollars.




-------------------------------------------------------------------------------------------------------------
Heroes and Heroines of New Media--2014

I am strongly influenced by Charles Hugh Smith, who runs his insightful social blog of Two Minds. I hope he will not mind, if I copy his style of acknowledgement to the supporters of our blog.

Our blog is deeply honored by the generous contribution of the following readers. Without their patronage, this site would go away.

Outstandingly Generous:   
Amemiya C. Schnable J. Bowman B. Osipowski P.
Shen M.      

We are also looking for subscribers to get help to finish the tutorials. Please see this post for details.

4 comments to Generalizing the Burrows-Wheeler Transform

  • Rayan

    Good find!
    My impressions are that the ST is very fast to compute but difficult to invert, and yields poorer compression than BWT.

  • samanta

    I do not know whether the patent made its acceptance difficult as well.

  • I think the application of the sort transform to bioinformatics has been more limited by the fact that you can’t build a compressed index from it (or at least no-one has done so yet, at least to my knowledge).

  • samanta

    Thank you Tony. That is very informative !

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>