Posts

Showing posts from August, 2014

Fasta performance comparison

Image
I wanted to add performance metrics for fasta format files since their simplicity means that they are often used to transmit large sequences; the main intended use case of the lazy loading parsers. Fasta files don't contain extensive annotations or any features so the primary task of parsing a fasta file is parsing the sequence. Included below are the compared times for both a large file and a medium sized file.

Here we can see that the lazy loader's sequence reading is about 20% slower for the full range. As with genbank files, lazy loading is much better when using slices to access only a portion of a record. Reading five percent of the sequence results in the expected 95% time savings when using the lazy loading parser. The results posted here and those in my previous blog post were made using Python 3.4.

Pre-pull-request performance tests and documentation.

Image
This week I am getting ready to make a pull request for my lazy loading features. While I will continue to commit small fixes and features I am not going to be developing at the fast pace of the GSoC from this point forward. For those interested in trying my lazy loading branch, it can be found on my Github.
While the lazy loading classes and source files are documented, a good place to start on learning the relatively simple usage pattern is with the tutorial file. I have temporarily posted just the sections written for lazy loading at this Dropbox public link.
I have done a performance evaluation by using SeqIO.index_db with and without the lazy loading parser on several large data files. The data was produced using my recently updated performance testing script available in this gist. My results are summarized in the graphs below. 

Each column should be taken as an atomic operation, not a step in a cumulative process. Thus the column "read seq" answer the question, "…

Preparing the lazy loading parser for public review

This last week has been very busy for me. My goal recently is to tie up loose ends and do important features that make integration with Biopython more complete and transparent. Largely this was inspired by a desire to start documenting my features. Larry Wall once said that laziness is a virtue of great programmers and I hope to embody that virtue by making my parser so similar to the existing index_db parser that I can extensively lean on existing documentation. I'm sure that the other lazy programmers out there will be happy to avoid reading a small novel about my new features.So far my activities have included:Writing a file-handle cache that limits the number of open filesWriting a file wrapper that exposes the cache to the lazy classWriting a scheme to use multiple files (as required by index_db)Integrating the lazy parsers with SeqIO's read, parse, and index_dbAdding support for key modifying functionsReorganizing and extending the testsAdding features to match the docum…

Lazy loading of Uniprot-XML

We are closing in on the pencils down date for Google Summer of Code but I'm still busy cleaning up my code and adding the last of my features. Previously I developed a method of indexing XML files using Python's expat bindings. This helped tremendously with the task of indexing and lazy-loading of Uniprot-XML files.As always, I reused the existing parsing apparatus extensively. During the initial phase of indexing, both the file handle and the XML content is handled by the new parser class that uses xml.parsers.expat to return an indexed XML element tree. During the actual loading of the record, the relevant file parts are read and the XML strings are passed to ElementTree for XML parsing, finally the elements are passed the existing Uniprot-XML parsing apparatus to fetch the information.For the next week I’ll be adding the last of the code necessary to expose lazy loading to the public API and I’ll be documenting the features. I'll also be doing some benchmarks before th…