[Biopython-dev] GenBank feature iterator
Peter
biopython-dev at maubp.freeserve.co.uk
Thu Jan 20 13:35:47 EST 2005
Hello
I'm trying to use BioPython to parse bacterial genomes from the NCBI:-
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
My initial impression is that all of the Bio.GenBank methods scale very
badly with the size of the input file.
For example, Nanoarchaeum equitans, file NC_005213.gbk is about 1.2 MB,
and can be loaded in about one minute using either the FeatureParser or
the RecordParser.
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk
However, for larger files the parser seems to run out of system
resources, or maybe requires more time than I have been prepared to give
it. e.g. E. coli K12, file NC_000913.gbk (about 10MB):-
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/
See also related posts in November 2004, e.g.
http://biopython.org/pipermail/biopython/2004-November/002470.html
To avoid the memory issues, I would like to make a single pass though
the file, iterating over the features (in particular, the CDS features)
one by one into SeqFeature objects (not holding them all in memory at once).
I have tried using the GenBank.Iterator, but as far as I can tell this
reads in a file and each "step" is an entire plasmid/chromosome (the
code looks for the LOCUS line).
It would seem that I would need:
A new "FeatureIterator", ideally using the existing Martel and
mxTextTools 'regular expressions on steroids' framework (which does seem
rather overwhelming!).
and:
A modified version of the FeatureParser to return (just) SeqFeature objects.
Any thoughts?
Thanks
Peter
--
PhD Student
MOAC Doctoral Training Centre
University of Warwick, UK
More information about the Biopython-dev
mailing list