Creating a faster container data structure

The analysis of a biological sequence often requires parsing and processing richly annotated metadata. Biological sequence data of proteins includes attributes such as the active site, sequence variants, post translational modifications, secondary structures, and domains. These annotations range in length from a single amino acid to hundreds. Genetic sequence data also contains a diverse range of annotations including the coding DNA sequence, exons, repeats, single nucleotide polymorphisms and many others. These annotations range in size from a single nucleic acid to millions of residues.

Given a sequence [S1, S2) and an annotation covering [A1, A2) we can say that the sequence intersects an annotation in several cases. (1) The beginning the an annotation is bounded by a sequence region or the entire annotation is bounded; both cases pass test 1 below. (2) The end of an annotated region is bounded by the sequence. (3) The sequence is entirely bounded by an annotated region.

  1. S1 ≤ A1 < S2
  2. S1 < A2 ≤ S2
  3. A1 < S1 and S2 < A2

While one would be tempted to think of a binary search for fast retrieval of this data, it is actually impossible. One could to construct two arrays of annotation data, one sorted by begin point and the other sorted by end point, allowing binary searches to test criteria (1) and (2). The third criteria would force testing of additional data eventually bringing performance back to the same level as brute force. To efficiently search for matching intervals a data structure called an interval tree is required.

Without going deeply into the details of implementation I can say shortly that multiple sets of bins are constructed, each set of bins representing the entire annotated region. Some sets have few large bins while others have many small bins. Parsed data is then added to the smallest possible bin that will fully contain the represented region. When searching for intersecting regions, only the bins containing potential matches will be searched leaving a large portion of the data untouched. The figure below shows an example of the sort of binning routine I implemented.

___Figure 1_________________________________________________
|                                                           |
|    feature1  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~              |
|    feature2  |       ~~~~                  |              |
|    feature3  |       |  |               ~~~~~~~~~~~~~     |
|              |       |  |               |  |        |     |
| bins:        |       |  |               |  |        |     |
|    0_________|_______|__|_______._______|__|____.___|_    |
|    1_________________|__|__   2_:_______|_______:___|_    |
|    3__________  4____|__|__   5_:________  6____:_____    |
|                                 :               :         |
|                                 :               :         |
|    query1                       [ ? ? ? ? ? ? ? ]         |
|...........................................................|

Implementation of a new data structure is useless without some basis of comparison. After I was satisfied with my implementation I wrote an additional annotation storage and search routine for comparison. Rather than use the interval tree, I stored annotations as a flat list sorted by the begin index. A search was done by iterating through the annotations until the begin index of an annotation was larger than the end index of the search range.

With the code ready, I tested each routine using the same large set of randomly generated data and what I found was unexpected: the simple class I had implemented as a comparison was much faster. The construction was about 5 times faster and the search was dead even in speed. Going back to the code I quickly found a few optimizations. First off, I had not sorted my binned data. By sorting the bins I would be able to truncate the search in some cases much like the single dimensional list. The second big speed up came from the observation that all annotations inside a bin fully bound by a query interval would satisfy the intersection tests, thus much of the data could be added without doing any comparison operations. Once these speed-ups were implemented, the interval tree search performed between two and four times faster depending on the character of the stored data. An unsurprising trend is that the interval tree performs faster when storing a high concentration of short intervals

Currently this tree is stored in memory and my planned implementation would discard the tree at the end of an analysis. Peter suggested developing some sort of disk based storage format in order to save index information. This is an idea I am taking seriously in light of the slow performance of building an index and the undoubtedly slower performance of medium comprehension parsing necessary to index a file.

Comments

Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Parsing EMBL and performance testing