Preparing the lazy loading parser for public review

This last week has been very busy for me. My goal recently is to tie up loose ends and do important features that make integration with Biopython more complete and transparent. Largely this was inspired by a desire to start documenting my features. Larry Wall once said that laziness is a virtue of great programmers and I hope to embody that virtue by making my parser so similar to the existing index_db parser that I can extensively lean on existing documentation. I'm sure that the other lazy programmers out there will be happy to avoid reading a small novel about my new features.

So far my activities have included:

  • Writing a file-handle cache that limits the number of open files
  • Writing a file wrapper that exposes the cache to the lazy class
  • Writing a scheme to use multiple files (as required by index_db)
  • Integrating the lazy parsers with SeqIO's read, parse, and index_db
  • Adding support for key modifying functions
  • Reorganizing and extending the tests
  • Adding features to match the documented feature set for parse and index_db
  • Refactoring database interaction by creating a dedicated index manager class

In addition to all of this work I hit a milestone this week that I am still proud of: my first contribution to the mainline Biopython. While writing the new test apparatus for the lazy loading class, I found a bug in SeqIO.index() related to selecting keys for EMBL format files. I implemented a fix and Peter cherry-picked it into the mainline after a we discussed it by email. This fix mean that EMBL files will have more complete index and index_db support, it also means that I can use comparative testing to help validate my new indexer/parser.

I think the big tasks that remain are to rewrite outdated or incomplete doc-strings, write a section in Tutorial.tex, and make concise share-able benchmarks.


Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Parsing EMBL and performance testing