DNA of a password disaster

Contents

In summary
A few details on Adobe's breach
Suffix arrays
The Burrows-Wheeler Transform and backward search
The FM Index
My FM Index implementation
Rank and select
Wavelet Trees
Bit Vector with O(1) rank and select
Cleaning the Adobe data
Computing the BWT with a little help from Amazon
C++ implementation
Cython wrapper
So what did I discover in the Adobe data?
Appendix: the FM Index and DNA sequencing
References

In summary

Back in October of last year, Adobe owned up to a massive security breach that it had then recently suffered. Identities of a whopping 152,989,508 accounts had been leaked by some nasty hackers. More recently, I came into possession of a file containing this data and poked around with it for a bit. Eventually however I grew tired of waiting for fgrep to finish as I probed the data for substrings of interest.

I thus decided to get a little coding exercise by implementing a version of the FM Index data structure and using it to search the Adobe data. I finally got round to finishing my implementation a month or so ago. The source (written in C++ and wrapped for python in Cython) is available here. Using this code, I can perform substring searches that previously took over two minutes in a few microseconds (a speed-up factor of nearly 108 in many cases).

I have written this post in large part for my future myself and the level of detail varies from place to place according to the peculiarities of my ignorance but I still hope it may be of interest to others. Any feedback will be gratefully received.

A few details on Adobe's breach

A good guide to the data in the Adobe breach is available in the article: Anatomy of a password disaster, hosted on Sophos's Naked Security site. We summarise the relevant details as follows.

A typical record from the leaked data looks something like this:

103238703-|-usually_blank_user_id-|-adobe_breach_example@gmail.com-|-Mai82+/AmfZ=-|-my cat's birthday|--

The three fields of interest in the above are: The Adobe breach is particularly unfortunate for several reasons other than the vast number of accounts affected. Probably the most serious are the following:
  1. The password hints are in plaintext.
  2. The passwords have been encrypted rather than hashed. Thus if somebody discovered the secret key, all passwords would be revealed.
  3. The encryption has not been salted so two passwords are the same if and only if their encrypted versions are the same.
  4. Each consecutive 8-byte block of a password was independently encrypted. It is thus trivial to tell whether two different passwords share an 8-byte-aligned 8-byte block, e.g., if they both have the same first 8 letters (hence this XKCD cartoon).
As many others have noted, the combination of 1 and 3 (or its stronger form 4) makes it incredibly easy to mine the data for passwords. Nevertheless, it is annoyingly slow searching the vast file using general purpose tools like fgrep (in spite of the GNU implementation's clever use of the Boyer-Moore algorithm as explained by the author here). The situation cries out for a data structure!

Suffix arrays

The fundamental query that we wish to be able to perform is to find a substring. A substring of a string is the same thing as a prefix of some suffix of that string and so it is natural to consider the set of all suffixes. If we sorted these lexicographically we could then quickly find substrings using a binary-search-based approach. Furthermore we would only need to store the indexes of the sorted suffixes. This is exactly the idea of the suffix array. For example, here is the suffix array for the string 'cohomology$', side by side with the corresponding suffixes:

Suffix array and corresponding suffixes for string cohomology$

(The reason for using a string ending in '$' will be explained below.)

The humble suffix array is extremely useful but it does have its disadvantages. These include:

There are various improvements that attempt to deal with these shortcomings but I believe by far the most elegant (and useful) is the FM Index data structure introduced by Ferragina, Manzini in their 2000 FoCS paper: "Opportunistic Data Structures with Applications".

The Burrows-Wheeler Transform and backward search

The idea for the FM Index goes back to Burrows and Wheeler's seminal 1994 Digital (now HP) research report: "A Block-sorting Lossless Data Compression Algorithm". The key idea is to consider the cyclic permutations (aka rotations) of a string instead of the suffixes. Thus instead of the table of suffixes above, we obtain a square matrix:

Suffix array and hypothetical matrix for string cohomology$

We shall henceforth refer to this as the hypothetical matrix. The reason for including the extra '$' character can now be explained: since it is lexicographically smaller than all other characters in the string, the lexicographic order of the suffixes is the same as that of the cyclic permutations. We shall refer to strings terminated with a unique smallest letter as minimally-terminated.

A substring can be defined as a prefix of a cyclic permutation and thus corresponds to a (possibly-empty) set of consecutive rows of the hypothetical matrix. Suppose that for a given string we have the hypothetical matrix to hand and wish to look for a substring. There is a beautifully-simple inductive algorithm for finding the corresponding rows, due to Ferragina and Manzini, known as backward search.

Note first that it is trivial to find a length-1 substring. Since the hypothetical matrix is sorted lexicographically by rows, we just need to find the set of consecutive rows that start with the letter that is our length-1 substring. We can accomplish this in logarithmic time using binary search, but if we suppose that we have precomputed the cumulative letter frequencies in our alphabet, we can do it in constant time. More precisely, define an integer-valued function, \(C\) say, on our alphabet as follows: $$ C[x] = \mbox{number of letters of our string strictly less than $x$} $$ For example, \(C\) would take the following values for our string:

Values of \(C\) function for string cohomology$

Even though it is trivial, we state the result we need formally for emphasis:

Proposition 1 [base case] Suppose we have an ordered alphabet, a distinguished letter \(x\) and a minimally-terminated length-n string over the alphabet. Using 0-based indexing, the set \(S_x\) of row indexes of the hypothetical matrix corresponding to the prefix x is the half-open interval: $$ S_x = [C[x], C[x']) $$ where \(x'\) is the next-largest character in the alphabet after \(x\) unless \(x\) is the maximal character in which case we read \(C[x']\) as \(n\).

The induction step is:

Proposition 2 [induction step] Suppose we have an ordered alphabet, a distinguished letter \(x\), a minimally-terminated string over the alphabet and a substring P that does not reach the terminating character. Suppose, using 0-based indexing, that the set SP of row indexes of the hypothetical matrix corresponding to prefix P is the half-open interval: $$ S_P = [lb, ub) $$ then the set \(S_{xP}\) of rows of the hypothetical matrix corresponding to prefix \(xP\) is the half-open interval: $$ S_{xP} = [lb', ub') $$ where: $$ lb' = C[x] + rank(x, lb-1)\\ ub' = C[x] + rank(x, ub-1) $$ and \(rank(x, i)\) is the number of occurrences of the letter x in the right-most column of the hypothetical matrix with row index at most \(i\). Furthermore, the above remains correct even if \( lb' \ge ub'\) since this happens iff \(xP\) is not a substring.

This innocent-looking proposition (whose proof we shall outline below) is extremely useful. Perhaps most importantly, it means that although we have been assuming we have the entire hypothetical matrix to hand, we can identify the rows corresponding to a substring using only the data of the left-most column (encoded as \(C\)) and the right-most column. In fact the right-most column is so important that it has its own name.

Definition The right-most column of the hypothetical matrix associated to a minimally-terminated string is known as the Burrows-Wheeler Transform of that string.

For example the Burrows-Wheeler Transform of cohomology$ is y$oooolcmhg. The transform was originally invented as an aid for text compression (it is used by bzip2 for example). As we shall see below, one can recover a string from its Burrows-Wheeler Transform and it tends to rearrange letters in a way that is more friendly to many compression techniques. It can thus provide a useful pre-compression transformation. Unlike the vast majority of text compression ideas, the Burrows-Wheeler transform is global in nature, making it a powerful complementary technique.

It is time to outline a proof of proposition 2.

Consider what we know about the hypothetical matrix under the hypotheses of proposition 2. Assuming for simplicity that xP is indeed a substring, the hypothetical matrix will look like this:


Shape of hypothetical matrix at one step of backward search

Using the notation of proposition 2 and recalling that the hypothetical matrix is sorted lexicographically by rows we see that: $$ lb' = C[x] + \mbox{(the number of rows of the form xQ... with } Q < P \mbox{ lexicographically)} $$ Key Observation There is a one-to-one correspondence between the rows of the hypothetical matrix of the form xQ... with Q < P lexicographically and the rows of the form Q...x with Q < P lexicographically. (In fact the correspondence is even order-preserving, though we make no use of this here.)

This follows from the fact that the rows of the hypothetical matrix are simply the set of all cyclic permutations of the given string.

Recalling the definition of the rank function from proposition 2, we complete the argument by noting that because the hypothetical matrix is sorted lexicographically by rows: $$ rank(x, i-1) = \mbox{the number of rows of the form Q...x with } Q < P \mbox{ lexicographically} $$ where \(i\) is the index of the first row beginning with \(P\). We thus arrive at the formula for \(lb'\) stated in proposition 2. A similar argument deals with \(ub'\) and a little case analysis shows that the results hold in the generality stated.

Just before we describe the FM Index data structure that is built around the backward search algorithm, it is worth highlighting this blog post of Alex Bowe. Amongst other things, Bowe has some superb pictures illustrating backward search.

The FM Index

Given some text, the backward search algorithm enables us to find the range of row indexes of the hypothetical matrix corresponding to a given substring. In practice however we almost never have the hypothetical matrix to hand and so we must decide how to turn the range of row indexes into something useful. In the simplest case when only the number of occurrences of the substring is of interest, we can just report the length of the range. Often however, what the user desires will be the set of indexes of where the substring occurs in the text.

If we have the suffix array to hand, the values of the suffix array at the row indexes provided by backward search will be the indexes of where the substring occurs in the text. Thus if we store both the suffix array and the Burrows-Wheeler Transform of the text, we can answer the substring problem very efficiently (we shall see below that it is easy to store the Burrows-Wheeler Transform in such a way that the required rank queries can be performed in constant time with good constant factors). However storing the suffix array does not escape the disadvantage that it usually takes up several multiples as much space as the text. For this reason, Ferragina and Manzini suggested storing only one in k values of the suffix array for some chosen value of k, such that the values are evenly-spaced in the text. That this sampled suffix array is still sufficient to locate substrings of the text (albeit with a mild time penalty) is a consequence of the following:

Proposition 3 If the \(i\)th letter of the Burrows-Wheeler Transform of a minimally-terminated string \(T\) is \(m\)th letter of \(T (m > 0)\) and we set: $$ x = select(i)\\ j = C[x] + rank(x, i) $$ where \(select(i)\) is the function that returns the \(i\)th letter of the Burrows-Wheeler Transform, then the \(j\)th letter of the Burrows-Wheeler Transform is the \((m-1)\)th letter of \(T\).

The proof follows easily by reasoning along lines used to establish proposition 2 above.

This extremely useful property of the Burrows-Wheeler Transform means that we can iterate backwards in the text using only its Burrows-Wheeler Transform and that provided we can answer both rank and select queries in constant time, each step takes only constant time. (This also shows that we can recover a string from its Burrows-Wheeler Transform, though there are easier ways of seeing this.) In particular we see that the sampled suffix array approach suggested by Ferragina and Manzini enables us to locate substrings with only an additive \(O(k)\) time penalty vs. storing the entire suffix array (iterate backwards till at one of the sampled values of the suffix array and then report the value found plus the number of steps taken).

Finally we note that there is no efficient algorithm for iterating forward in the text using its Burrows-Wheeler Transform. This asymmetry is due to the fact that the lexicographic sorting used to construct the hypothetical matrix is left-to-right. If we were to sort right-to-left, we would have the opposite asymmetry. A moment's thought reveals this is really just the same the Burrows-Wheeler Transform of the reversed text. We will make use of this below.

My FM Index implementation

For the application I had in mind (searching the leaked Adobe data) I did not care about the location of a substring in the text. All I cared about was the context around a given occurrence of a sought-for substring. For example, if I wanted to find all records whose password hint included the string "my cat", I could get an entire record from each occurrence of the substring "my cat" by iterating in the text back to the last new line character, as well as forward to the next.

As we have seen, iterating backward is not a problem, but iterating forward is. There are multiple solutions to this problem but I felt the most natural approach was to store the Burrows-Wheeler Transform of both the text and its reverse. This doubles the memory requirement but gives us the ability to iterate in either direction in the text.

Unfortunately what the above approach gives us is unidirectional iteration in either direction but not bidirectional iteration, which is really what we need. The problem is that locations in the Burrows-Wheeler Transform of the text and its reverse do not naturally correspond. They each give row indexes, but into different hypothetical matrices: the rows are differently permuted. Storing the permutation that lines up indexes into the Burrows-Wheeler Transform of the text and its reverse would be too memory hungry so we need a way to compute this on the fly.

To solve this problem, I took advantage of the fact that for the ordering of the row indexes obtained by performing backward search on the Burrows-Wheeler Transform of the text, the contexts after each occurrence of the sought-for pattern are sorted in increasing lexicographic order. The corresponding ordering of the row indexes obtained by performing backward search on Burrows-Wheeler Transform of the reversed text can thus be obtained by sorting these according to this context. A most-significant-digit radix-sort fit the problem well and usually takes sub-linear time to complete.

Notwithstanding the above, I should point out that when there are large numbers of occurrences of the sought-for pattern, the sorting does become a bottleneck even though it is asymptotically optimal.

Rank and select

Given a string over a fixed alphabet, there are two complementary queries we often need to perform (and which are critical to the FM Index). These are the select and rank queries.

select takes an index into position in the string and returns the letter at that location. rank takes an index into a position in the string as well as a letter from the alphabet and returns the number of occurrences of that letter in the string, up to and including that location.

The following table provides some examples:


rank and select for the string cohomology

Wavelet Trees

The Wavelet Tree idea reduces the problem of building a data structure that supports constant-time rank and select queries for strings over a fixed, arbitrary-sized, ordered alphabet to the same problem for strings over an alphabet of size two, i.e., bit vectors.

The catchy name is a bit misleading (there is a very weak analogy with some techniques used to perform the Wavelet Transform) but the idea is extremely appealing, very useful and astonishingly recent. Wavelet Trees were introduced in the SODA 2003 paper of Grossi, Gupta, Vitter, "High-Order Entropy-Compressed Text Indexes".

The easiest way to explain the idea, is to start with a picture:

Wavelet Tree for string cohomology

The blue text is present solely for purposes of explanation; the tree stores only the data in red. This data in red consist of a bit vector and an ordered alphabet. The tree is defined recursively as follows.

Given a string over an ordered alphabet, construct a bit vector with length equal to that of the string and with 0 in each position where the corresponding letter of the string is in the first half of the alphabet and 1 otherwise. Store this bit vector together with the ordered alphabet. If the alphabet has size greater than two, then construct two new strings: one from the letters of the string in the first half of the alphabet and one from the letters of the string in the second half of the alphabet. Now also store these strings as left and right child Wavelet Trees.

To perform select at a given index, we start by looking at the value of the bit vector in the root node at that index. If it is 0 we recursively ask the same question at the corresponding index in the left tree and if it is 1 we go right. We stop when we hit a leaf in which case the value of the bit vector at the corresponding index in that node's bit vector tells us which element of the alphabet is the result. rank is similarly easy.

A Wavelet Tree over an alphabet of size A has depth at most lg(A) and so rank and select take O(lg(A)) time, assuming the bit vectors support constant-time rank and select. We can regard this as constant time for a fixed alphabet.

Note also that the sum of the lengths of all the bit vectors at each level is equal to the length of the original string. Ignoring the usually-negligible space required to store the alphabets and tree structure, the Wavelet Tree thus takes up Nlg(A) space for a string of length N. This means that it can provide a little natural compression for a string that was originally represented in terms of a smaller alphabet embedded in a larger one. E.g., a 7-bit ASCII string of length N stored as 8-bit characters will take up 7N bits as a Wavelet Tree instead of the original 8N bits (this turned out to provide a very useful 12.5% space-saving in my case).

Note also that the structure of a Wavelet Tree (i.e., the tree structure, the length of the bit vector at each node and the size of the alphabet at each node) depends only on the frequency of the each letter in the string. In other words, the structure of the Wavelet Tree does not change when we permute the letters in a string and in particular is the same for a string, its Burrows-Wheeler Transform and that of its reverse.

Indeed the structure of the Wavelet Tree is really just an representation of the information of the function denoted C in proposition 1 above. This is useful since it means we can calculate C for free when building a Wavelet Tree to store the Burrows-Wheeler Transform of some text. This little bonus makes the Wavelet Tree data structure fit perfectly with the FM Index data structure.

Bit Vector with O(1) rank and select

When the Burrows-Wheeler Transforms are stored as Wavelet Trees, the FM Index is mostly just a collection of bit vectors. Only two demands are made of these bit vectors: they must be able to support both rank and select queries in constant time.

The C++ std::vector<bool> type is a bit vector with constant-time select but it takes linear time to compute rank. However we can provide a constant-time rank routine by fixing some number k and storing the precomputed rank for every kth value of i. Furthermore if we break these runs of k-bits, let's call them superblocks, into shorter blocks with size equal to a standard word, say 32 bits, then most of the work of the rank function is reduced to calculating the Hamming weight (aka popcount) of at most k/2 of these shorter blocks and there are fast bit-manipulation techniques for this. (GNU C++ compilers even include a function __builtin_popcount for this.) We thus obtain a satisfactory bit-vector type with minimal effort. This simple solution is the strategy currently in use in my implementation.

It is not difficult to improve significantly on the solution outlined above. There is a very simple compressed bit vector data structure described in the 2007 paper ACM TA paper of Raman, Raman, Rao, "Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets". The bit vector data structure Raman, Raman and Rao describe, is also based around a superblock-block two-level bucketing idea and provides constant-time rank and select. It is very fast in practice and is easy to implement. A good outline is provided in this blog post of Alex Bowe.

Since the bit vector is the work horse of the FM Index, any improvement here makes a big difference. I expect to implement this compressed bit vector structure on one of my next rainy afternoons.

Cleaning the Adobe data

The raw data that was leaked from Adobe was packaged as a single file, a little over 9GB in size. It was a mostly-text file with one record per line (most of the time). The data needed a little cleaning so I performed the following steps: For example, a record that looks like this:

103238703-|-usually_blank_user_id-|-adobe_breach_example@gmail.com-|-Mai82+/AmfZ=-|-my cat's birthday|--

in the original data, becomes this:

adobe_breach_example@gmail.com\tMai82+/AmfZ=\tmy cat's birthday

after cleaning. As a result of these operations, the data shrank to 6.6GB and the new file contained exactly 152,989,508 lines (one record per line) in happy agreement with the last line of the original file. I decided to leave the passwords encoded in base 64 to avoid having to use non-printable characters.

Of these 152,989,508 records, 22,663,815 of them had an empty password field. Thus the number of accounts whose encrypted password was leaked was 130,325,693.

Computing the BWT with a little help from Amazon

The Burrows-Wheeler Transform is sufficiently important that a number of algorithms exist for its computation. As far as I could tell, all the acceptably-fast algorithms are space-hungry because they temporarily store the entire suffix array. I did not quite have the appetite to implement an efficient Burrows-Wheeler Transform routine so I looked around and discovered that the name of the game here is Yuta Mori.

In 2008/9 Mori released OpenBWT 1.5 under an MIT license. I found it easy to carve out what I needed from OpenBWT and did so. I later discovered that the current state of the art seems to be Mori's more recent SAIS but in any case, OpenBWT was sufficient.

I was faced with a problem however: I wished to compute the Burrows-Wheeler Transform of a string of length 6,621,386,639 (the size of the cleaned Adobe data). Since the time-efficient BWT algorithms (including OpenBWT) temporarily store the suffix array and since this would evidently need to be an array of 64-bit indexes, I would need an additional 6.6 x 8 = 52.8GB of memory in addition to the 6.6GB for the original data. Sadly my laptop has a mere 16GB of RAM. The obvious way around this problem would be to chop the data up into blocks (like bzip2 does) and so effectively to store the data as several instances of the FM Index. I wanted to avoid this, and other solutions would require a more significant coding effort than I found appealing. Grr.

Pondering what to do I suddenly remembered I have been dying for an excuse to try out Amazon's cloud computing services. So I stumped up and paid for a few hours on one of Amazon's EC2 instances with a whopping 117GB of RAM (far more than I needed). After making trivial changes to Mori's 32bit OpenBWT so that it was 64-bit, I tried the job on an EC2 instance. Incredibly, it worked first time!

Finally I should say that because I wanted the Burrows-Wheeler Transform of both the Adobe data and of its reverse, I in fact ran two instances of the 64-bit OpenBWT code. It occurred to me that there are some close connections between the Burrows-Wheeler Transform of a text and its reverse so it should be possible to do better than blindly computing twice from scratch. Indeed this is the case. Although I did not put it to use, the following recent paper explains how one can use the Burrows-Wheeler Transform of some text and its suffix array to compute the Burrows-Wheeler Transform of its reverse more quickly than otherwise: Ohlebusch, Beller, Abouelhoda, "Computing the Burrows-Wheeler transform of a string and its reverse in parallel", JDA (2013).

C++ implementation

The essence of my FM Index implementation looks like this:

  class FMIndex
  {
  public:
      FMIndex(const std::string & s);

      FMIndex(std::istreambuf_iterator serial_data);

      void serialize(std::ostreambuf_iterator serial_data) const;

      size_t find(std::list> & matches,
                  const std::string & pattern,
                  const size_t max_context = 100) const;

      std::list find_lines(const std::string & pattern,
                                        const char new_line_char = '\n',
                                        const size_t max_context = 100) const;
  }
  
Construction couldn't simpler: you just provide a string! It does however take some time to construct an FM Index: well over two hours for the Adobe data (mostly the time taken compute the Burrows-Wheeler Transforms on EC2). For this reason I wrote the basic serialization method and deserialization constructor shown above. I can deserialize the Adobe data in about 50 seconds (and this should become 20 seconds if I implement the compressed bit vector structure mentioned above).

Finally there is the all important find method. This populates a user-supplied list with pairs of iterators (one forward, one reverse), inserting one pair of iterators for each time the sought-for pattern occurs in the text used to construct the FM Index. As mentioned above, this method is very fast. The iterators allow the user to step forward or backward in the text around each match. For the (cleaned) Adobe data, since records correspond to lines, all I wanted was to find the lines containing each occurrence of the pattern. The find_lines method provides this functionality.

Cython wrapper

C++ is wonderfully fast (to run) but I wanted to be able to play with the Adobe data interactively. I needed to reach for another tool and Python was the obvious choice but I had not previously written my own Python module. After looking into it a little, I found there were four plausible ways to present my C++ code to Python. Below is what I found, together with the factors that influenced my decision: Cython was the clear winner and I was delighted with it. The following Cython code, when cythonized, turns my C++ into code ready for compilation as a Python module:

  # distutils: language = c++
  # distutils: include_dirs = ../FM-Index ../openbwt-v1.5
  # distutils: sources = ../FM-Index/FMIndex.cpp ../FM-Index/WaveletTree.cpp ../FM-Index/BitVector.cpp ../openbwt-v1.5/BWT.c

  from libcpp.list cimport list
  from libcpp.string cimport string

  cdef extern from "FMIndex.h":
      cdef cppclass FMIndex:
          FMIndex(string) except +
          list[string] find_lines(string)

  cdef extern from "FMIndex.h" namespace "FMIndex": # static member function hack
      FMIndex * new_from_serialized_file(string)

  cdef class PyFMIndex:
      cdef FMIndex * thisptr
      def __cinit__(self, s):
          self.thisptr = new FMIndex(s)
      def __dealloc__(self):
          del self.thisptr
      def find_lines(self, pattern):
          return self.thisptr.find_lines(pattern)
      def new_from_serialized_file(self, filename):
          del self.thisptr
          self.thisptr = new_from_serialized_file(filename)
  

So what did I discover in the Adobe data?

Sadly, I do not think it would be ethical to make public most of the little discoveries I made. In particular, I am not willing to share a password that might compromise anybody's account.

I can share some summary data and a few harmless scraps though. Of the 130,325,693 encrypted passwords leaked, 44,084,634 were unique. Common passwords are easy to find and other people have already played this game. Indeed the top 100 most common passwords are listed here.

Amongst the top 100 passwords, "soccer" is the highest-ranking sport on the list ("liverpool" is the only Premiership team), "monkey" is the only animal present and "superman" is the only comic-book hero mentioned.

As regards my own discoveries, there are some things I noticed which can, I think, be revealed without breaching my taxing ethical standards. So here is a list of harmless, but hopefully amusing, observations.

The above list took about 10 seconds to collect when playing with the FM Index in a Python shell (the running time was mere microseconds). It is frighteningly-easy to mine the data with these tools.

Appendix: the FM Index and DNA sequencing

I have relegated this section to an appendix as it is a bit of a detour and is rather sketchy. Nevertheless, I find the application very interesting.

By now there are a number of variants of the FM Index. These appear to have been driven mostly by the desire to be able to locate approximate substrings. The most important application of the FM Index is currently to the short-read mapping problem in DNA sequencing. This problem is almost exactly the problem of finding the location of a large number of relatively short substrings (DNA fragments called short reads) in a fixed long string (a reference genome sequence). It is important to be able to locate approximate DNA fragments in a reference genome because short reads may contain sequencing errors or (more interestingly) there may be DNA mutations.

Below is a brief outline of some FM Index variants:

It seems the data structure has yet to find its final form but is converging on it. In my opinion, any ideas that make it easier to link the indexes in the Burrows-Wheeler Transform of the text and its reverse would be particularly useful and there is likely to be room for progress from the present state of knowledge.

In terms of implementations that solve the short-read mapping problem Bowtie seems to be the most-often cited tool but 2BWT and BWA/MAQ are also frequently mentioned. These are all FM-Index-based. In addition there are tools like SOAPv2, Zoom and BFAST (an interesting approach that uses the Smith-Waterman algorithm) but I am not sure where these fit in. No doubt there are many other tools. From a practical point of view, one of the things I found exciting was that there are already several FPGA implementations of some of these algorithms so it seems speed really is critical.

At present it seems that even though FM Index is not ideally suited to the problem of locating approximate substrings, it remains a highly-valued tool.

The whole area looks exciting and ripe for contributions!

References

  1. Burrows, Wheeler, "A Block-sorting Lossless Data Compression Algorithm", Digital Systems Research Center, Research Report 124, (1994).
  2. Ferragina, Manzini, "Opportunistic Data Structures with Applications", Foundations of Computer Science, Proceedings of 41st Annual Symposium (2000).
  3. Ferragina, Manzini, Makinen, Navarro, "An Alphabet-Friendly FM-index", Lecture Notes in Computer Science 3246, in Proc. SPIRE 2004 (2004).
  4. Grossi, Gupta, Vitter, "High-Order Entropy-Compressed Text Indexes" SODA 2003, Proceedings of 14th annual ACM-SIAM symposium on discrete algorithms (2003).
  5. Lam, Li, Tam, Wong, Wu, Yiu, "High Throughput Short Read Alignment via Bi-directional BWT", 2009 IEEE International Conference on Bioinformatics and Biomedicine.
  6. Li, Durbin, "Fast and accurate short read alignment with Burrows-Wheeler transform", Bioinformatics, Vol. 25, no. 14 (2009).
  7. Ohlebusch, Beller, Abouelhoda "Computing the Burrows-Wheeler transform of a string and its reverse in parallel", Journal of Discrete Algorithms, (in press, 2013).
  8. Raman, Raman, Rao, "Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets", ACM Transactions on Algorithms, Vol. 3, No. 4 (2007).