Back to coding: storage of the _index dictionary
After a period of relatively low activity due to my obligations in graduate school and recently due to my presentation at the ASMS conference, I am now doing full-time coding for Biopython! In my previous blog update I mentioned the possibility of saving index information in an XML file in order to quickly load a file index rather than having to parse them every time. After a conversation with Peter and Bow we decided not to use XML and instead to use a SQLite database similar to SeqIO.index_db.
In the intended use case of SQLite storage for the lazy loading and indexing parser, the database file name would be passed to the ‘lazy’ parameter rather than a simple Boolean true. With the database file name, the lazy parser will first connect and verify that the database contains indices for a file. If indices are found, each record will be returned from the database rather than from direct reads to the file. If indices are not found, the file will be indexed fully and written to the database prior to accessing records from the database. As an added convenience, I will probably add a function to do the indexing without invoking SeqRecProxy returns, although this helper function can wait until later.
The database works behind a normal SeqRecProxy object by turning the _index into a property where setting or loading can invoke database calls if specific criteria are met. In this way, the class does not require extensive rewriting to access the database nor does it require external parsers and database readers to initialize values manually. An added benefit to this choice is that in-memory only operation will be as simple as invoking a lazy record without a database file
The current schema seen in the table below defines three database tables primarily for index data and one as a meta data key-value store. This schema is similar to the existing index_db although it adds several relational features not used there. A feature to note is that ‘main_index’ is not fully defined. This table actually relies on the first storage operation of a given file type to guide its creation by pulling dictionary keys and assuming they contain integers. Since each file type may have a variety of properties to track the table will be highly format specific. In the future, I may try to standardize the formats, but that will have to wait until I've added indexing functions for additional formats to see if I can identify the homologous portions. As a final note for those not familiar with SQLite, not all tables have explicit primary keys because each table has a default 'rowid' row which acts as an auto incrementing primary key.
meta_data -name----------type--------notes----- key TEXT unique value TEXT indexed_files -name----------type--------notes----- fileid INTEGER primary key filename TEXT unique count INTEGER features -name----------type--------notes----- fileid INTEGER references indexed_files.fileid featurenumber INTEGER offsetbegin INTEGER offsetend INTEGER seqbegin INTEGER seqend INTEGER meta TEXT main_index -name----------type--------notes----- id TEXT fileid INTEGER references indexed_files.fileid ... INTEGER name from _index ... ... name from _index ... INTEGER name from _index
Currently I have written all IO operations and SQL queries for the main_index table which means I am also dynamically creating the table (when empty) and doing consistency checking. The only remaining functionality will be setters and loaders for the _features_index collection. This should be easier to implement since all the consistency checking will have executed by the time this property is accessed and the schema will be format independent making query writing faster and getter logic more straightforward.