Pre-pull-request performance tests and documentation.

This week I am getting ready to make a pull request for my lazy loading features. While I will continue to commit small fixes and features I am not going to be developing at the fast pace of the GSoC from this point forward. For those interested in trying my lazy loading branch, it can be found on my Github.

While the lazy loading classes and source files are documented, a good place to start on learning the relatively simple usage pattern is with the tutorial file. I have temporarily posted just the sections written for lazy loading at this Dropbox public link. 

I have done a performance evaluation by using SeqIO.index_db with and without the lazy loading parser on several large data files. The data was produced using my recently updated performance testing script available in this gist. My results are summarized in the graphs below. 

Each column should be taken as an atomic operation, not a step in a cumulative process. Thus the column "read seq" answer the question, "how long does it take to parse a file in order to read the sequence?" The column "read 5% features" answers how long it would take to read features corresponding to the 5 percent of the sequence length. The rest of the columns should be easy enough to figure out. Other than the indexing operation, all times are from operations using a preexisting index.

Notice that the current parser runs at constant time for all operations since initializing the record necessarily parses all of the record. Conversely, initializing the lazy parser merely adds a pointer to the bulk of the information and is very fast. For accessing only the features or only the sequence there are significant time savings but the biggest performance improvement comes when accessing fractional sequences since read times now are a function of the size of the requested data rather than of the record size. All three plots are for the genbank format, this format was used because it represents a general worst case scenario for any parser due to the complexity and size of the records.


Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Parsing EMBL and performance testing