[Biopython-dev] GenBank feature iterator

Thu Jan 20 13:35:47 EST 2005

Hello

I'm trying to use BioPython to parse bacterial genomes from the NCBI:-

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/

My initial impression is that all of the Bio.GenBank methods scale very 
badly with the size of the input file.

For example, Nanoarchaeum equitans, file NC_005213.gbk is about 1.2 MB, 
and can be loaded in about one minute using either the FeatureParser or 
the RecordParser.

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk

However, for larger files the parser seems to run out of system 
resources, or maybe requires more time than I have been prepared to give 
it.  e.g. E. coli K12, file NC_000913.gbk (about 10MB):-

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/

See also related posts in November 2004, e.g.

http://biopython.org/pipermail/biopython/2004-November/002470.html

To avoid the memory issues, I would like to make a single pass though 
the file, iterating over the features (in particular, the CDS features) 
one by one into SeqFeature objects (not holding them all in memory at once).

I have tried using the GenBank.Iterator, but as far as I can tell this 
reads in a file and each "step" is an entire plasmid/chromosome (the 
code looks for the LOCUS line).

It would seem that I would need:

A new "FeatureIterator", ideally using the existing Martel and 
mxTextTools 'regular expressions on steroids' framework (which does seem 
rather overwhelming!).

and:

A modified version of the FeatureParser to return (just) SeqFeature objects.

Any thoughts?

Thanks

Peter
-- 
PhD Student
MOAC Doctoral Training Centre
University of Warwick, UK