Back to coding: storage of the _index dictionary

After a period of relatively low activity due to my obligations in graduate school and recently due to my presentation at the ASMS conference, I am now doing full-time coding for Biopython! In my previous blog update I mentioned the possibility of saving index information in an XML file in order to quickly load a file index rather than having to parse them every time. After a conversation with Peter and Bow we decided not to use XML and instead to use a SQLite database similar to SeqIO.index_db.

In the intended use case of SQLite storage for the lazy loading and indexing parser, the database file name would be passed to the ‘lazy’ parameter rather than a simple Boolean true. With the database file name, the lazy parser will first connect and verify that the database contains indices for a file. If indices are found, each record will be returned from the database rather than from direct reads to the file. If indices are not found, the file will be indexed fully and written to the database prior to accessing records from the database. As an added convenience, I will probably add a function to do the indexing without invoking SeqRecProxy returns, although this helper function can wait until later.

The database works behind a normal SeqRecProxy object by turning the _index into a property where setting or loading can invoke database calls if specific criteria are met. In this way, the class does not require extensive rewriting to access the database nor does it require external parsers and database readers to initialize values manually. An added benefit to this choice is that in-memory only operation will be as simple as invoking a lazy record without a database file

The current schema seen in the table below defines three database tables primarily for index data and one as a meta data key-value store. This schema is similar to the existing index_db although it adds several relational features not used there. A feature to note is that ‘main_index’ is not fully defined. This table actually relies on the first storage operation of a given file type to guide its creation by pulling dictionary keys and assuming they contain integers. Since each file type may have a variety of properties to track the table will be highly format specific. In the future, I may try to standardize the formats, but that will have to wait until I've added indexing functions for additional formats to see if I can identify the homologous portions. As a final note for those not familiar with SQLite, not all tables have explicit primary keys because each table has a default 'rowid' row which acts as an auto incrementing primary key.

key           TEXT        unique
value         TEXT

fileid        INTEGER     primary key
filename      TEXT        unique
count         INTEGER

fileid        INTEGER     references indexed_files.fileid
featurenumber INTEGER
offsetbegin   INTEGER
offsetend     INTEGER
seqbegin      INTEGER
seqend        INTEGER
meta          TEXT

id            TEXT
fileid        INTEGER     references indexed_files.fileid
...           INTEGER     name from _index
...           ...         name from _index
...           INTEGER     name from _index

Currently I have written all IO operations and SQL queries for the main_index table which means I am also dynamically creating the table (when empty) and doing consistency checking. The only remaining functionality will be setters and loaders for the _features_index collection. This should be easier to implement since all the consistency checking will have executed by the time this property is accessed and the schema will be format independent making query writing faster and getter logic more straightforward.


Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Building Biopython on Windows