Parsing GenBank features

For the last week I’ve been working on indexing and lazy loading of GenBank sequence features. Since this is the first feature-annotated format I've converted to lazy-loading I am also adding to the lazy-abstract class code to help manage the feature index.

The strategy for making the feature index that I have found to work and to be moderately straightforward required a somewhat deep rewrite of the parse_features method in the GenBank scanner class. Rather than modify the scanner class I've used it's code as the base for my indexing method and I am now bypassing this step inside the scanner entirely. The code has to be modified directly because file position checking is needed after each potentially relevant read operation. When certain criteria are met, the offset for a specific feature is saved. Rather than manually parse each feature’s positional information, I used the captive consumer object to fully parse each feature and then used the ‘location.nofuzzy_start’ and ‘location.nofuzzy_end’ to extract the correct feature range. While completely parsing each feature during the indexing process result in slower indexing, the use of the index database will mean this operation is done only once. As an added benefit, full feature parsing will act as a file sanity checker so that database writing will be aborted if the file is corrupt or not formatted correctly.

The lazy index is stored as a list of 5-tuples. In each tuple, the first two indices are for the sequence location relative to the intact (unsliced) SeqRecord. The second pair of indices indicate the beginning and end file offsets. Finally the last index is largely unused at this stage, but it contains a string qualifier for the feature. In the future this qualifier may be potentially useful for prefiltering features although that remains to be seen. The use of a short tuple for each sequence feature ensures that the index will remain relatively small compared to even the most simple fully parsed sequence features.

Consuming the feature index is similar in nature to the feature indexing function but instead of detailed machinery that steps through the file and identifies feature segments, only the necessary data is passed to the consumer apparatus after reading since each feature's range is aldread known from the feature index. Many sanity assumptions are made since the record has already been indicated to be well formatted by the indexer and the init class has indicated that the file is unchanged since the previous indexing operation.


Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Parsing EMBL and performance testing