[Biopython-dev] Re-written GenBank/EMBL feature location parsing

Fri Jun 25 15:21:46 UTC 2010

Hi all,

I've been working on and off recently on rewriting the location
parsing for GenBank/EMBL features:
http://bugzilla.open-bio.org/show_bug.cgi?id=2738

I have a branch ready for public testing,
http://github.com/peterjc/biopython/commits/location-parsing2

The old code is still there (and indeed right now gets used as a fall
back with a warning if an unrecognised location is seen). I'd like to
label it (plus Bio.Parsers and Bio.Parsers.spark) as obsolete for the
next release, and then deprecate them the subsequence release.

The old code takes each location string, parses it with SPARK and
generates a set of token objects for each element (see the code in
Bio.GenBank.LocationParser) and then turns that into SeqFeature
location and position objects. All this object creation is probably a
major reason why the old code is slow.

The new code takes each location string, and parses it with a mix
of regular expressions and simple Python code, and then builds
the SeqFeature location and position objects. On my tests this is
at least twice as fast, typically between three and four times faster.

The intention is this parser change will result in no functional
changes at all.

As part of this work I have been extending the feature unit tests,
and have also run some more extensive additional tests locally
(GenBank files for plants, viruses, environmental samples etc).
I'm reasonably sure this covers all the location variants... but
with GenBank and EMBL files you can never be sure ;)

Would anyone like to volunteer to test the new branch before
I merge it to the trunk? I'm also interested in comments on the
code itself. Note I have tried to avoid any refactoring until the
old code is actually deprecated.

Thanks,

Peter