Lazy loading of GenBank format

The past week and a half I have been working on a lazy loading parser for the GenBank format. GenBank itself is one of the largest and most widely used databases of genetic sequences. High quality support of their format is important for Biopython and support of this format is highly important for my lazy loading/indexing feature to gain adoption. Between supporting Fasta and GenBank formats I will already be covering a large segment of sequence I/O tasks.

Unlike the fasta format which is notable for its simplicity and liberal interpretation, genbank format is strict and much more verbose. This strict specification and verbosity allows highly detailed annotations to be passed by a GenBank file. Information such as coding regions, compliment features, variants, and much more are all assignable to specific regions of the transmitted sequence. This wealth of information is useful to researchers but makes the act of parsing these files more difficult.

The architecture of the current parser is that of a scanner/consumer. The two relevant modules are Bio.GenBank.Scanner and Bio.GenBank._FeatureConsumer. A simplified outline of their internal working is that the scanner accepts a file handle, advances the reading head, and identifies segments of information that need to be consumed in specific ways. The consumer accepts strings from the scanner, parses those strings, and adds the prepared data to a SeqRecord object.

My initial plan was to use the scanner for simultaneously reading the file and making an index by checking the handle position between scanner steps. This had the advantage of easy implementation, while performing fewer read operations, and producing more compact code. Frustratingly, this implantation could not exist. We have chosen to work with binary file reads to prevent automatic newline correction in Python 3 (which will potentially impact seek performance) and the existing scanner will not accept binary files. An additional complication is that this strategy would necessarily connect the parsing step to the indexing step. This connection counters my current strategy and would require reimplementing the __init__ function.

My current solution is similar to that chosen by the Bio.SeqIO._index module: I am pre-indexing the file in one step (or retrieving the index from my SQLite DB as discussed last week), and I will be using the index to read the file and create a StringIO object that will be passed to the relevant functions. Unlike the current index function, the StringIO object will only contain a small subset of the file required to parse the targeted information. So far I am passing basic unit tests for everything but feature parsing.


Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Building Biopython on Windows