Parsing EMBL and performance testing
Last week I spent some time working with the EMBL format. After an initial working version using an EMBL specific index builder I was able to create a generalized index building method that covers both EMBL and GenBank. Currently the parser is working in many common cases with just a few edge cases to cover before all files included in the test suite will be indexed and read correctly. At the moment, all "well behaved" files (ie. files with a header, features, and a sequence) will be parsed correctly.
As an update to my performance evaluation, the elimination of the unnecessary file
tell() calls has made my indexing routine significantly more time-competitive with the traditional SeqIO parsers. I have made a performance testing script for anybody who wants to try the new lazy-loading parser to see how it performs with their data or with a reference data set. Pulling from my lazy-load feature branch today will give support for fasta, GenBank, and EMBL formats. The performance test script can be found in a gist I have posted
For the impatient, I'll quickly summarize some of the results using GenBank's human chromosome 1 as an example being parsed on my laptop with an AMD E2-1800 processor. The worst case scenario for the lazy-loading parser is indexing followed by a full read of the sequence and all features. This operation sums to 129.1 seconds, compared to 96 seconds to retrieve the same data using the traditional parser. Since indexing is only required once another almost-worst-case scenario is pulling an indexed file and reading the full sequence and all features. This sums to 72 seconds, saving 24 seconds compared to the traditional parser. Lazy parsing really shines when only partial records are required. If I only needed the first 5% of information from my example data set, it is returned by the lazy parser after less than 4.5 seconds giving a 95% speed boost to the read operations compared to the old parser.
Below I have included the raw results for the GenBank file and for the EMBL file. Each performance test was run twice in a row before saving results to account any potential performance boost given by OS-level file caching.
#GenBank speed test results SeqIO.parse time = 96.353s Lazy record indexing time = 57.085s Lazy record fetch time = 0.152s Sequence fetch time = 47.910s All feature fetch time = 24.002s 5% feature fetch time = 1.937s 5% sequence fetch time = 2.397s #EMBL speed test results SeqIO.parse time = 132.518s Lazy record indexing time = 89.636s Lazy record fetch time = 0.183s Sequence fetch time = 40.735s All feature fetch time = 52.198s 5% feature fetch time = 5.912s 5% sequence fetch time = 3.649s