[BioPython] Bio.GenBank FeatureParser vs RecordParser
Peter
biopython at maubp.freeserve.co.uk
Sat Sep 16 23:31:50 UTC 2006
I've been looking at some timings for parsing GenBank files, in
particular FeatureParser vs RecordParser
The test file I'm using is one of the largest bacterial genomes, the
GenBank file is almost 24MB:
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk
On my nice new desktop:
RecordParser takes about 5s to return a Bio.GenBank.Record object.
FeatureParser takes about 45 to 50s to return a SeqRecord object.
Feature location parsing (and setting up associated sub feature objects)
takes about 90% of that time. By commenting this out (*), the
FeatureParser is actually faster than RecordParser.
I personally am only very very rarely interested in the location
objects, and indeed for some things actually prefer the raw location string.
For the long term unification of BioPython's sequence input (something
being discussed on the development list) a move to standardising on all
sequence parsers returning SeqRecords has been proposed - so we should
do something about the slowness of the GenBank feature parser.
Right now I'd suggest a boolean option controlling if the location
should be parsed and turned into a nice object orientated
representation, or simply held as a raw string.
Do people think this is a good idea or not?
The other option (which I do plan to look into) is improving the
location parser so that it doesn't cause such a slow down.
Peter
(*) = Just make the location function do an immediate return in class
_FeatureConsumer in file Bio/GenBank/__init__.py
More information about the Biopython
mailing list