Finishing GenBank

This week I finished up my GenBank SeqRecordProxyBase derived class and tested it against a variety of cases. The class is quite a bit more stable now and it is hardened against a number of cases that previously raised errors.

An ongoing task is to profile the code to determine the low-hanging fruit for optimization tasks. A sufficiently long test for the indexer is the GenBank record for human chromosome 1. I decided to test the creation of an index since this is currently the slowest operation. Included below are the top results from cProfile when sorted by internal time.

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 14109609   86.100    0.000   86.100    0.000 {method 'tell' of ...
        1   49.685   49.685  234.552  234.552 InsdcIO.py:1279(_make_record_index)
 22834575   47.643    0.000   47.643    0.000 {method 'match' of '_sre...
 13678883   41.716    0.000   41.719    0.000 re.py:273(_compile)
 13678882   34.838    0.000  107.724    0.000 re.py:153(match)
        2   34.093   17.046  102.116   51.058 _lazy.py:813(sequential_...
  9529616   19.269    0.000   19.269    0.000 {method 'readline' of ...
  9119356   12.859    0.000   12.859    0.000 {method 'decode' of 'bytes' ...
  9119356   12.495    0.000   25.355    0.000 __init__.py:53()
        1    5.589    5.589   50.622   50.622 InsdcIO.py (_make_feature_index)
    20320    5.588    0.000   17.791    0.001 __init__.py (location)
  5103868    3.991    0.000    3.991    0.000 {method 'strip' of 'str' objects}
    20320    3.101    0.000    4.517    0.000 Scanner.py:211(parse_feature)
   195641    2.952    0.000    5.906    0.000 SeqFeature.py:583(__init__)
  1714721    2.342    0.000    2.342    0.000 {built-in method isinstance}
   190047    2.142    0.000    2.142    0.000 SeqFeature _get_location_operator

The handle.tell() call seems to be one of the most resource hungry portions of the indexing operation. Conversely readline() was expected to be resource hungry but was actually only the sixth most resource intensive call. Considering that the tested file has 4.55 million lines, the 14 million calls to tell() is a bit excessive and this indicates that removing the tell operation where possible should actually improve performance significantly.

This week I will continue to optimize the GenBank parser and I will also begin adding support for EMBL formatted files.

Comments

Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Parsing EMBL and performance testing