Indexing XML files in Python
XML provides some unique challenges for file indexing. Up to now, the sequence file formats I have worked with used line-based data where newline characters have concrete meaning. This is useful since an easy way to iterate over a text file is through the
readline() call or though calling it in a for loop which will produce a line iterator. XML is a tag based format where readability is conferred by tags rather than by whitespace and newlines. This means that I could hypothetically have a complete and valid XML file whose entire contents are loaded by a singe
readline call. Conversely it is possible that a single chunk of valid data is separated by many newlines while remaining XML compliant.
Looking at large XML files in Python is fairly difficult since the default handlers ElementTree and minidom are oriented toward complete files, thus invoking the parser will automatically read the entire file to generate a single contiguous data structure. Furthermore, these handlers strip all file-based context from the objects they produce so it is difficult to tie the original XML statements to an object produced by these parsers.
Stream-based parsers are an attractive option for indexing since, by their nature, they will iterate over the file piece-wise. This piece-wise iteration could potentially give access to file offset and help avoid the parsing of unneeded text. They are also more difficult to use since they produce unstructured data iteratively as they encounter the XML components. Python includes several stream based parsers. Python's
xml.sax module provides an
IncrementalParser that will yield XML elements, but the
Locator call only provides a line number and column, not an absolute file position. Python's
xml.etree also has a stream based parser. I looked into this briefly but I could only access file location by overriding the file read calls to take byte-wise chunks. This resulted in massive slowdowns both due to redundant read calls and and unnecessary checks against the XML consumer. Had the etree parser worked, I still would have needed to develop a way of handling the stream output. The stream parser that I finally ended up using was the
xml.parsers.expat parser. The expat parser is a fast c-based parser with bindings in many programming languages. The reason I eventually used expat was because it has a call to output the current file position, and it can accept a handle that has been set to a specific position without raising any errors or altering the position prior to parsing.
I have implemented a simple parser on top of
expat that build a tree containing the tag offsets, tag text, attributes, and a method to extract each tag's raw text from the handle. This is now included in the _lazy module and will be used for indexing Uniprot-XML Biopython. A biopython-decoupled version is now also posted on my github.