Burrows-Wheeler Reversible Sorting Algorithm [Original Paper Implementation]
Implementing the 1994 report "A Block Sorting Lossless Data Compression Algorithm"
Quick Intro
LeetArxiv is Leetcode for implementing Arxiv papers and other technical reports.
1.0 Introduction
Burrows-Wheeler is a reversible sorting algorithm. That is, one can go from sorted to unsorted form. The algorithm is central to the BZIP21 compression utility.
In the article below, we looked at BZIP2’s superiority in Elon Musk’s compression challenge:
Burrows-Wheeler was published in the 1994 technical report: A Block Sorting Lossless Data Compression Algorithm (Burrows & Wheeler, 1994)2 and the paper is 24 pages long.
As usual, we code only the important parts of the paper.
2.0 Coding Burrows-Wheeler
The typical Burrows-Wheeler implementation is split into two parts: a forward transformation and a reversible transformation.
*The paper calls the forward transformation the Compression transformation and the reversible transformation a Decompression transformation.
2.1 Forward/Compression Transformation
The forward transform is pretty simple. Here’s a no-fluff summary with an example:
Forward Step in Burrows-Wheeler Example
Step 1: Append a special character at the end of an input array.
*The special character is elucidated in (Langmead, 2000)3
**The original paper opts to send the sorted array as well as its lexicographic index for reversibility.
Let our input be the array [5, 12, 4, 3, 7, 2] of length 6.
Appending a special character modifies the array to [5, 12, 4, 3, 7, 2, $]
The special character must satisfy these properties:
Occurs nowhere else in the text.
Has a value less than all other characters in the array.
Step 2: Rotate the input array by n times. n is the length of the modified array.
n is equal to 7 for the modified array [5, 12, 4, 3, 7, 2, $].
The 7 rotations of our array are :
Step 3: Sort the rotation lexicographically. This means sorting the array of rotations from smallest to largest by comparing their indices.
Step 4: Take the last column as the output.
Output is [2,7,4,12,$, 3, 5]
In Python (Langmead, 2000):
def rotations(t):
""" Return list of rotations of input string t """
tt = t * 2
return [ tt[i:i+len(t)] for i in xrange(0, len(t)) ]
def bwm(t):
""" Return lexicographically sorted list of t’s rotations """
return sorted(rotations(t))
def bwtViaBwm(t):
""" Given T, returns BWT(T) by way of the BWM """
return ''.join(map(lambda x: x[-‐1], bwm(t)))
2.2 Reverse/Decompression Transformation
The reverse algorithm builds a list of predecessor characters. Here’s the paper summary with no-fluff:
Reverse Step in Burrows-Wheeler Example for Output [2,7,4,12,$, 3, 5]
Step 1: Create a table with two columns.
The first column is the sorted output.
The second column is the original output.
Step 2: Use predecessor characters to map the Last to the First character across columns.
Initialize an empty array and place $ to get [$]
$ in first column occurs at index 4 on second column where the value is 5.
Place 5 at the front of our array [5,$].
Find 5 in column 2. Observe that it maps to 12 in column 1.
Place 12 at the front of our array [12, 5,$].
Find 12 in column 2. Observe that it maps to 4 in column 1.
Place 4 at the front of our array [4,12, 5,$].
Find 4 in column 2. Observe that it maps to 3 in column 1.
Place 3 at the front of our array [3,4,12, 5,$].
Find 3 in column 2. Observe that it maps to 7 in column 1.
Place 7 at the front of our array [7,3,4,12, 5,$].
Find 7 in column 2. Observe that it maps to 2 in column 1.
Place 2 at the front of our array [2, 7, 3,4,12, 5,$].
Step 4: Reverse the array to get the original input
Reversal output becomes [$ , 5 , 12 , 4 , 3 , 7 , 2]
This matches the original input [5, 12, 4, 3, 7, 2, $].
In Python, it can be written as
def inverse_bwt(bwt):
if not bwt:
return ""
#Initialize the 2-column table
table = list(zip(bwt, range(len(bwt))))
#Sort to get the first column (F)
F = sorted(table, key=lambda x: (x[0], x[1]))
# Reconstruct the original string
row = bwt.index('$')
original = []
for _ in range(len(bwt) - 1):
char, row = F[row] #Get next character and row
original.append(char)
# Reverse to get original order (since we built it backward)
return original[::-1] + ['$']
Burrows-wheeler is used for data compression. Let me know if you use it anywhere else.
Feedback:murage.kibicho@leetarxiv.com
Footnotes
BZIP2 Homepage. https://sourceware.org/bzip2/
Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm (Technical Report No. 124). Digital Equipment Corporation.