Working with the Fasta format

Fasta is one of the oldest sequence formats in common use and easily one of the most ubiquitous. The format itself dates to the FASTA alignment tools originally published in 1985 and the simplicity of the format is likely the reason for it's continued prevalence. A Fasta file can contain one or several sequences where each sequence is initiated by a title line beginning with a '>'. Shown below is an example of the contents of a Fasta file with two sequences.

>sp|O15205|UBD_HUMAN Ubiquitin D OS=Homo sapiens GN=UBD PE=1
 MAPNASCLCVHVRSEEWDLMTFDANPYDSVKKIKEHVRSKTKVPVQDQVLLLGSKILKPR
 RSLSSYGIDKEKTIHLTLKVVKPSDEELPLFLVESGDEAKRHLLQVRRSSSVAQVKAMIE
 RSLSSYGIDKEKTIHLTLKVVKPSDEELPLFLVESGDEAKRHLLQVRRSSSVAQVKAMIE
 TKTGIIPETQIVTCNGKRLEDGKMMADYGIRKGNLLFLACYCIGG
 >sp|P15516|HIS3_HUMAN Histatin-3 OS=Homo sapiens GN=HTN3 PE=1
 MKFFVFALILALMLSMTGADSHAKRHHGYKRKFHEKHHSHRGYRSNYLYDN

Currently I am working on a lazy-loading parser for Fasta. Going off my previous work with the lazy-loading base class, the primary logic being implemented in the lazy loading parser will involve file IO. An interesting trend I've noticed is that much of the __init__ function, which I had assumed would be format specific, may end up being quite generic. By adding a generic __init__ to the lazy loading base class I will be able to make more generic assumptions about what attributes are set. Testing the lazy loading parser is a substantial task since so much of the actual sequence parsing had to be implemented independently in order to account for on-demand access. Simple errors such as an extra character, an unanticipated space, or line-widths that vary have to be accounted for.

Another near term goal I am looking at is adding an XML format index storage and retrieval mechanism. This was not in the original proposal but it seems that in ordinary use-cases, it would be an essential feature to drive wider adoption of my subroutines. Currently, indices are built on demand adding a substantial performance penalty to opening large files. While the pre-indexing is faster and less memory intensive than full parsing, it still requires full reading of large files. By adding an index output, users will be able to index a file once and reuse the index many times with a greatly expedited instantiation every subsequent read

Comments

Popular posts from this blog

Indexing XML files in Python

Fasta performance comparison

Parsing EMBL and performance testing