Posts

Fasta performance comparison

Image
I wanted to add performance metrics for fasta format files since their simplicity means that they are often used to transmit large sequences; the main intended use case of the lazy loading parsers. Fasta files don't contain extensive annotations or any features so the primary task of parsing a fasta file is parsing the sequence. Included below are the compared times for both a large file and a medium sized file. Here we can see that the lazy loader's sequence reading is about 20% slower for the full range. As with genbank files, lazy loading is much better when using slices to access only a portion of a record. Reading five percent of the sequence results in the expected 95% time savings when using the lazy loading parser. The results posted here and those in my previous blog post were made using Python 3.4.

Pre-pull-request performance tests and documentation.

Image
This week I am getting ready to make a pull request for my lazy loading features. While I will continue to commit small fixes and features I am not going to be developing at the fast pace of the GSoC from this point forward. For those interested in trying my lazy loading branch, it can be found on  my Github . While the lazy loading classes and source files are documented, a good place to start on learning the relatively simple usage pattern is with the tutorial file. I have temporarily posted just the sections written for lazy loading at  this Dropbox public link.   I have done a performance evaluation by using SeqIO.index_db with and without the lazy loading parser on several large data files. The data was produced using my recently updated performance testing script available in  this gist . My results are summarized in the graphs below.  Each column should be taken as an atomic operation, not a step in a cumulative process. Thus the column "read seq

Preparing the lazy loading parser for public review

This last week has been very busy for me. My goal recently is to tie up loose ends and do important features that make integration with Biopython more complete and transparent. Largely this was inspired by a desire to start documenting my features. Larry Wall once said that laziness is a virtue of great programmers and I hope to embody that virtue by making my parser so similar to the existing index_db parser that I can extensively lean on existing documentation. I'm sure that the other lazy programmers out there will be happy to avoid reading a small novel about my new features. So far my activities have included: Writing a file-handle cache that limits the number of open files Writing a file wrapper that exposes the cache to the lazy class Writing a scheme to use multiple files (as required by index_db) Integrating the lazy parsers with SeqIO's read, parse, and index_db Adding support for key modifying functions Reorganizing and extending the tests Adding features t

Lazy loading of Uniprot-XML

We are closing in on the pencils down date for Google Summer of Code but I'm still busy cleaning up my code and adding the last of my features. Previously I developed a method of indexing XML files using Python's expat bindings. This helped tremendously with the task of indexing and lazy-loading of Uniprot-XML files. As always, I reused the existing parsing apparatus extensively. During the initial phase of indexing, both the file handle and the XML content is handled by the new parser class that uses xml.parsers.expat to return an indexed XML element tree. During the actual loading of the record, the relevant file parts are read and the XML strings are passed to ElementTree for XML parsing, finally the elements are passed the existing Uniprot-XML parsing apparatus to fetch the information. For the next week I’ll be adding the last of the code necessary to expose lazy loading to the public API and I’ll be documenting the features. I'll also be doing some benchmarks befo

Indexing XML files in Python

XML provides some unique challenges for file indexing. Up to now, the sequence file formats I have worked with used line-based data where newline characters have concrete meaning. This is useful since an easy way to iterate over a text file is through the readline() call or though calling it in a for loop which will produce a line iterator. XML is a tag based format where readability is conferred by tags rather than by whitespace and newlines. This means that I could hypothetically have a complete and valid XML file whose entire contents are loaded by a singe readline call. Conversely it is possible that a single chunk of valid data is separated by many newlines while remaining XML compliant. Looking at large XML files in Python is fairly difficult since the default handlers ElementTree and minidom are oriented toward complete files, thus invoking the parser will automatically read the entire file to generate a single contiguous data structure. Furthermore, these handlers strip all

Parsing EMBL and performance testing

Last week I spent some time working with the EMBL format. After an initial working version using an EMBL specific index builder I was able to create a generalized index building method that covers both EMBL and GenBank. Currently the parser is working in many common cases with just a few edge cases to cover before all files included in the test suite will be indexed and read correctly. At the moment, all "well behaved" files (ie. files with a header, features, and a sequence) will be parsed correctly. As an update to my performance evaluation, the elimination of the unnecessary file tell() calls has made my indexing routine significantly more time-competitive with the traditional SeqIO parsers. I have made a performance testing script for anybody who wants to try the new lazy-loading parser to see how it performs with their data or with a reference data set. Pulling from my lazy-load feature branch today will give support for fasta, GenBank, and EMBL formats. The performan

Mid-week performance updates

As a small mid-week blog update, I wanted to put out my updated profiling data. Peter brought up the issue of redundant file-read passes and we discussed various ways to fix it. Now I've pushed my updated lazy parser code and included below are the profiling results in an identical test to that posted last week. tell() was previously used 14 million times, and readline() 9.5 million times, now tell() calls have been reduced by over 90% to only 1.25 million calls and readline() calls were cut in half. Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 19.161 19.161 74.577 74.577 InsdcIO.py:1279(_make_record_in... 4969941 18.411 0.000 18.411 0.000 {method 'readline' of 'file' ob... 4559813 11.647 0.000 14.985 0.000 re.py:226(_compile) 4559812 9.643 0.000 33.604 0.000 re.py:134(match) 4596151 9.304 0.000 9.304 0.000 {method 'match' of &#