Posts

Showing posts from July, 2014

Indexing XML files in Python

XML provides some unique challenges for file indexing. Up to now, the sequence file formats I have worked with used line-based data where newline characters have concrete meaning. This is useful since an easy way to iterate over a text file is through the readline() call or though calling it in a for loop which will produce a line iterator. XML is a tag based format where readability is conferred by tags rather than by whitespace and newlines. This means that I could hypothetically have a complete and valid XML file whose entire contents are loaded by a singe readline call. Conversely it is possible that a single chunk of valid data is separated by many newlines while remaining XML compliant.Looking at large XML files in Python is fairly difficult since the default handlers ElementTree and minidom are oriented toward complete files, thus invoking the parser will automatically read the entire file to generate a single contiguous data structure. Furthermore, these handlers strip all fil…

Parsing EMBL and performance testing

Last week I spent some time working with the EMBL format. After an initial working version using an EMBL specific index builder I was able to create a generalized index building method that covers both EMBL and GenBank. Currently the parser is working in many common cases with just a few edge cases to cover before all files included in the test suite will be indexed and read correctly. At the moment, all "well behaved" files (ie. files with a header, features, and a sequence) will be parsed correctly.As an update to my performance evaluation, the elimination of the unnecessary file tell() calls has made my indexing routine significantly more time-competitive with the traditional SeqIO parsers. I have made a performance testing script for anybody who wants to try the new lazy-loading parser to see how it performs with their data or with a reference data set. Pulling from my lazy-load feature branch today will give support for fasta, GenBank, and EMBL formats. The performance …

Mid-week performance updates

As a small mid-week blog update, I wanted to put out my updated profiling data. Peter brought up the issue of redundant file-read passes and we discussed various ways to fix it. Now I've pushed my updated lazy parser code and included below are the profiling results in an identical test to that posted last week. tell() was previously used 14 million times, and readline() 9.5 million times, now tell() calls have been reduced by over 90% to only 1.25 million calls and readline() calls were cut in half. Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 19.161 19.161 74.577 74.577 InsdcIO.py:1279(_make_record_in... 4969941 18.411 0.000 18.411 0.000 {method 'readline' of 'file' ob... 4559813 11.647 0.000 14.985 0.000 re.py:226(_compile) 4559812 9.643 0.000 33.604 0.000 re.py:134(match) 4596151 9.304 0.000 9.304 0.000 {method 'match' of '_sr…

Finishing GenBank

This week I finished up my GenBank SeqRecordProxyBase derived class and tested it against a variety of cases. The class is quite a bit more stable now and it is hardened against a number of cases that previously raised errors.An ongoing task is to profile the code to determine the low-hanging fruit for optimization tasks. A sufficiently long test for the indexer is the GenBank record for human chromosome 1. I decided to test the creation of an index since this is currently the slowest operation. Included below are the top results from cProfile when sorted by internal time. Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 14109609 86.100 0.000 86.100 0.000 {method 'tell' of ... 1 49.685 49.685 234.552 234.552 InsdcIO.py:1279(_make_record_index) 22834575 47.643 0.000 47.643 0.000 {method 'match' of '_sre... 13678883 41.716 0.000 41.719 0.000 re.py:273(_compile) 1367888…

Parsing GenBank features

For the last week I’ve been working on indexing and lazy loading of GenBank sequence features. Since this is the first feature-annotated format I've converted to lazy-loading I am also adding to the lazy-abstract class code to help manage the feature index.The strategy for making the feature index that I have found to work and to be moderately straightforward required a somewhat deep rewrite of the parse_features method in the GenBank scanner class. Rather than modify the scanner class I've used it's code as the base for my indexing method and I am now bypassing this step inside the scanner entirely. The code has to be modified directly because file position checking is needed after each potentially relevant read operation. When certain criteria are met, the offset for a specific feature is saved. Rather than manually parse each feature’s positional information, I used the captive consumer object to fully parse each feature and then used the ‘location.nofuzzy_start’ and ‘l…

Lazy loading of GenBank format

The past week and a half I have been working on a lazy loading parser for the GenBank format. GenBank itself is one of the largest and most widely used databases of genetic sequences. High quality support of their format is important for Biopython and support of this format is highly important for my lazy loading/indexing feature to gain adoption. Between supporting Fasta and GenBank formats I will already be covering a large segment of sequence I/O tasks.Unlike the fasta format which is notable for its simplicity and liberal interpretation, genbank format is strict and much more verbose. This strict specification and verbosity allows highly detailed annotations to be passed by a GenBank file. Information such as coding regions, compliment features, variants, and much more are all assignable to specific regions of the transmitted sequence. This wealth of information is useful to researchers but makes the act of parsing these files more difficult.The architecture of the current parser …

Back to coding: storage of the _index dictionary

After a period of relatively low activity due to my obligations in graduate school and recently due to my presentation at the ASMS conference, I am now doing full-time coding for Biopython! In my previous blog update I mentioned the possibility of saving index information in an XML file in order to quickly load a file index rather than having to parse them every time. After a conversation with Peter and Bow we decided not to use XML and instead to use a SQLite database similar to SeqIO.index_db. In the intended use case of SQLite storage for the lazy loading and indexing parser, the database file name would be passed to the ‘lazy’ parameter rather than a simple Boolean true. With the database file name, the lazy parser will first connect and verify that the database contains indices for a file. If indices are found, each record will be returned from the database rather than from direct reads to the file. If indices are not found, the file will be indexed fully and written to the datab…

Working with the Fasta format

Fasta is one of the oldest sequence formats in common use and easily one of the most ubiquitous. The format itself dates to the FASTA alignment tools originally published in 1985 and the simplicity of the format is likely the reason for it's continued prevalence. A Fasta file can contain one or several sequences where each sequence is initiated by a title line beginning with a '&#62'. Shown below is an example of the contents of a Fasta file with two sequences. >sp|O15205|UBD_HUMAN Ubiquitin D OS=Homo sapiens GN=UBD PE=1 MAPNASCLCVHVRSEEWDLMTFDANPYDSVKKIKEHVRSKTKVPVQDQVLLLGSKILKPR RSLSSYGIDKEKTIHLTLKVVKPSDEELPLFLVESGDEAKRHLLQVRRSSSVAQVKAMIE RSLSSYGIDKEKTIHLTLKVVKPSDEELPLFLVESGDEAKRHLLQVRRSSSVAQVKAMIE TKTGIIPETQIVTCNGKRLEDGKMMADYGIRKGNLLFLACYCIGG >sp|P15516|HIS3_HUMAN Histatin-3 OS=Homo sapiens GN=HTN3 PE=1 MKFFVFALILALMLSMTGADSHAKRHHGYKRKFHEKHHSHRGYRSNYLYDNCurrently I am working on a lazy-loading parser for Fasta. Going off my previous work with the lazy-lo…

Designing an abstract base for lazy loading proxies

My previous experience with programming has so far been self directed and I can generally make design decision using relatively simple approaches. making complex and extendable software requires deeper consideration of the underlying design. I found a few design resources on the internet specific to Python that were helpful in educating myself on common design patterns. Alex Martelli has an excellent presentation on design patterns in Python which has a nice overview of numerous design patterns implemented in Python as well as a discussion of the issues encountered when translating design patterns made for different languages. Sakis Kasampalis' Github repository python-paterns is also an excellent resource for exploring design options, albeit with a focus on showing by example rather than discussing the best use cases. Wikipeida is also a decent resource when researching a specific design pattern, although some of the articles are a bit sparse.After reviewing the options, the init…

Building Biopython on Windows

Biopython is far from easy to build on Windows. While Linux environments can build programs relatively easily, Windows suffers from obtuse dependencies and hard to find developer tools. It is because of the difficulty of building on Windows that the canonical way to get Biopython onto a Windows platform is through an installer built for each release. In order to test the newest Biopython builds and my own branch on the Windows platform, I need to prepare a build ready environment on my Windows 8 boot partition. This is especially important for my development goals which will have to compensate for the differences between Windows and Linux text files.
Building Biopython in Windows is most easily done with Microsoft visual studio or visual C++. Lacking $1000 for a full version of this software I was pleased to find the express editions are free to use, albeit hard to find. Without the correct C compiler available the build system will raise an error regarding vcvarsall.bat. Others on…

Creating a faster container data structure

The analysis of a biological sequence often requires parsing and processing richly annotated metadata. Biological sequence data of proteins includes attributes such as the active site, sequence variants, post translational modifications, secondary structures, and domains. These annotations range in length from a single amino acid to hundreds. Genetic sequence data also contains a diverse range of annotations including the coding DNA sequence, exons, repeats, single nucleotide polymorphisms and many others. These annotations range in size from a single nucleic acid to millions of residues.Given a sequence [S1, S2) and an annotation covering [A1, A2) we can say that the sequence intersects an annotation in several cases. (1) The beginning the an annotation is bounded by a sequence region or the entire annotation is bounded; both cases pass test 1 below. (2) The end of an annotated region is bounded by the sequence. (3) The sequence is entirely bounded by an annotated region.S1 ≤ A1 &a…

Testing Biopython

Biopython is distributed for Python 2.6 through 3.4 from a single codebase, Biopython is also written with the intention to maintain support for alternative Python distributions such as Pypy and Jython. Constant and careful testing is an important step during development to maintain broad compatibility. Several options exist for this task. The easiest but least flexible method is to use the system Python interpreters. Ubuntu 14.04 includes Python 2.7 and 3.4. Ubuntu also has pypy-2.2 in the default repositories. These interpreters can be used directly with the unit-tests. A disadvantage to using the system interpreters directly is the eventual contamination caused by adding many different versions of various dependencies. In order to avoid this downside, more advanced approaches involving virtualization technology, such as Docker, are an option. In my case, I used a middle road involving much less work than full scale virtualization but conferring some of the benefits. Virtualenv is a…

Setting up version control for Biopython

Having never been involved in a distributed development project before I knew that just gaining the technical ability to contribute to Biopython would be an uphill battle. To contribute to Biopython I had to prepare a development environment and get a handle on the git forking and merging workflow. Today I'd like my blog post to address my progress with git.My previous experience with git is pretty standard for a developer who has not contributed to other projects before; I have several projects hosted on Github and I use git pull and git push exclusively to manage them and sync them between computers. This has the side effect of keeping my version history linear. Contribution to Biopython requires the full power of git's branching, merging, and tracking features. Thankfully, the hard work and thorough documentation of others has made this task easier for the beginners like myself.When it became clear to me that my application was competitive, I started preparing early by read…

How I wrote a successful Google Summer of Code application for Biopython

There are many resources speaking generally about qualities of a successful GSoC proposal and my first suggestion to others is to read those resources. The reason I have chosen to write this blog post is simply to give my own limited insights into the process. It is worth saying that every open source project is going to have their own priorities and set of requirements so my observations, and those I've linked to, may not be generally applicable.https://drupal.org/node/59037
http://blog.gerv.net/2006/05/how_not_to_apply_for_summer_of/
http://www.di.ens.fr/~baghdadi/TXT_blog/5_advices_to_get_your_proposal_accepted.lyx.htmlWith that out of the way, I can get to my own experience. I have used Biopython for about a year after deciding that manually parsing uniprot data wasn’t worth the trouble. When I heard that Biopython would be accepting a GSoC student I immediately decided that I would field a competitive application. The project idea posted by organizers regarding implementation o…

Applying for Google's Summer of Code program!

Edit 8/19/2014:
I was accepted to the GSoC program. I've written a blog on the process of applying and some general thoughts on how to write a successful application.

Originally submitted on 03/24/2014
I decided today to kill the old EvanAParker.com PHP page and resurrect something decent in its place. Gone are my days of manually editing HTML to update what amounts to a pseudo-resume and link-farm. Onward to the days of having the awesome power of a corporate grade CMS system to manage my pseudo-resume and link-farm.
I've taken some time recently to apply to Google's Summer of Code open source internship match-making program. They fund a several month period of coding under mentorship for an open source project. While many applicants spend the time to apply to many OSS organizations, I threw all my chips into the Open Bioinformatics Foundation's basket by writing a proposal for Biopython. If I'm selected I'll work on upgraded file parsers for sequence annotatio…