[BioPython] Bio.GenBank FeatureParser vs RecordParser

Sat Sep 16 23:31:50 UTC 2006

I've been looking at some timings for parsing GenBank files, in 
particular FeatureParser vs RecordParser

The test file I'm using is one of the largest bacterial genomes, the 
GenBank file is almost 24MB:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk

On my nice new desktop:

RecordParser takes about 5s to return a Bio.GenBank.Record object.

FeatureParser takes about 45 to 50s to return a SeqRecord object.

Feature location parsing (and setting up associated sub feature objects) 
takes about 90% of that time.  By commenting this out (*), the 
FeatureParser is actually faster than RecordParser.

I personally am only very very rarely interested in the location 
objects, and indeed for some things actually prefer the raw location string.

For the long term unification of BioPython's sequence input (something 
being discussed on the development list) a move to standardising on all 
sequence parsers returning SeqRecords has been proposed - so we should 
do something about the slowness of the GenBank feature parser.

Right now I'd suggest a boolean option controlling if the location 
should be parsed and turned into a nice object orientated 
representation, or simply held as a raw string.

Do people think this is a good idea or not?

The other option (which I do plan to look into) is improving the 
location parser so that it doesn't cause such a slow down.

Peter

(*) = Just make the location function do an immediate return in class 
_FeatureConsumer in file Bio/GenBank/__init__.py