[Biopython-dev] GenBank feature iterator

Peter biopython-dev at maubp.freeserve.co.uk
Sun Jan 30 06:13:58 EST 2005


I wrote:

> I'm trying to use BioPython to parse bacterial genomes from the NCBI:-
> 
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
> 
> My initial impression is that all of the Bio.GenBank methods scale very 
> badly with the size of the input file.

More detailed testing would appear to confirm this.  On the bright
side, I haven't actually run into any errors parsing unknown formats.

> For example, Nanoarchaeum equitans, file NC_005213.gbk is about 1.2 MB, 
> and can be loaded in about one minute using either the FeatureParser or 
> the RecordParser.
> 
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk 
> 
> However, for larger files the parser seems to run out of system 
> resources, or maybe requires more time than I have been prepared to give 
> it.  e.g. E. coli K12, file NC_000913.gbk (about 10MB):-
  >
  > ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/

While my laptop did not have sufficient RAM to deal with this file,
my desktop could in about three minutes - see below.

> See also related posts in November 2004, e.g.
> 
> http://biopython.org/pipermail/biopython/2004-November/002470.html

I have been doing some testing on this, and more memory makes a
marked difference in the size of GenBank file that can be loaded
(comparing my laptop and home desktop).

The following times and memory usage figures are on Windows 2000,
Python 2.3 running the attached script from idle.

NC_003065.gbk	   480 kb,   4 seconds,  28 MB RAM
NC_003064.gbk	 1,217 kb,  11 seconds,  56 MB RAM
NC_000854.gbk	 3,391 kb,  45 seconds, 165 MB RAM
NC_003063.gbk	 4,725 kb,  55 seconds, 195 MB RAM
NC_003062.gbk	 6,574 kb,  88 seconds, 268 MB RAM
NC_005966.gbk	 8,858 kb, 139 seconds, 372 MB RAM
NC_000913.gbk	10,267 kb, 171 seconds, 409 MB RAM
NC_000962.gbk	11,010 kb, 200 seconds, 486 MB RAM
NC_003997.gbk	12,026 kb, 228 seconds, 496 MB RAM
NC_002678.gbk	15,120 kb, 306 seconds, 586 MB RAM

The computer was a 2.26 GHz Intel Pentium 4, with 735 MB RAM.

For larger files (e.g. NC_005027.gbk at 18,211 kb) the system would
run out of memory and begin paging to disk.  For this particular
example, the test eventually completed in half an hour.

I have not performed this test under Linux, but a couple of examples
suggests the behaviour is similar.

In consuming these vast amounts of memory is the GenBank parser
really running "as designed"?

It is conceivable that as smaller genomes have tended to be
sequenced first, that the parser was not originally expected to have
to deal with such large genomes.

Or is there a bug here?

> To avoid the memory issues, I would like to make a single pass though 
> the file, iterating over the features (in particular, the CDS features) 
> one by one into SeqFeature objects (not holding them all in memory at 
> once).
> 
> I have tried using the GenBank.Iterator, but as far as I can tell this 
> reads in a file and each step is an entire plasmid/chromosome (the 
> code looks for the LOCUS line).
> 
> It would seem that I would need:
> 
> A new FeatureIterator, ideally using the existing Martel and 
> mxTextTools 'regular expressions on steroids' framework (which does seem 
> rather overwhelming!).
> 
> and:
> 
> A modified version of the FeatureParser to return (just) SeqFeature 
> objects.

I have tried (and so far failed) to understand how the Martel and
mxTextTools parser, and thus modify it in the way I had hoped.

Peter

--
Here is the script, inline rather than as an attachment, which the 
mailing list didn't like:


#Following is based on example code from
#http://www.biopython.org/docs/tutorial/Tutorial.html
#3.4.2  Parsing GenBank records

#The example files are all from the NCBI's ftp site,
#ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/

import time
from Bio import GenBank

#The following times and memory usage figures are on Windows 2000, 
Python 2.3 running this script from idle.
#The computer was a 2.26 GHz Intel Pentium 4, with 735 MB RAM.
#The memory usage is at the end of the script, rather than the peak 
value which is slightly higher, as recorded
#by the windows task manager's process watch.
gb_file = 
"C:\\genomes\\Bacteria\\Agrobacterium_tumefaciens_C58_Cereon\\NC_003065.gbk" 
#    480 kb,   4 seconds,  28 MB RAM
gb_file = 
"C:\\genomes\\Bacteria\\Agrobacterium_tumefaciens_C58_Cereon\\NC_003064.gbk" 
#  1,217 kb,  11 seconds,  56 MB RAM
gb_file = "C:\\genomes\\Bacteria\\Aeropyrum_pernix\\NC_000854.gbk" 
                    #  3,391 kb,  45 seconds, 165 MB RAM
gb_file = 
"C:\\genomes\\Bacteria\\Agrobacterium_tumefaciens_C58_Cereon\\NC_003063.gbk" 
#  4,725 kb,  55 seconds, 195 MB RAM
gb_file = 
"C:\\genomes\\Bacteria\\Agrobacterium_tumefaciens_C58_Cereon\\NC_003062.gbk" 
#  6,574 kb,  88 seconds, 268 MB RAM
gb_file = 
"C:\\genomes\\Bacteria\\Acinetobacter_sp_ADP1\\NC_005966.gbk" 
          #  8,858 kb, 139 seconds, 372 MB RAM
gb_file = 
"C:\\genomes\\Bacteria\\Escherichia_coli_K12\\NC_000913.gbk" 
          # 10,267 kb, 171 seconds, 409 MB RAM
gb_file = 
"C:\\genomes\\Bacteria\\Mycobacterium_tuberculosis_H37Rv\\NC_000962.gbk" 
     # 11,010 kb, 200 seconds, 486 MB RAM
gb_file = 
"C:\\genomes\\Bacteria\\Bacillus_anthracis_Ames\\NC_003997.gbk" 
          # 12,026 kb, 228 seconds, 496 MB RAM
gb_file = "C:\\genomes\\Bacteria\\Mesorhizobium_loti\\NC_002678.gbk" 
                   # 15,120 kb, 306 seconds, 586 MB RAM

#During the following file, Windows ran out of RAM and was paging to 
the hard disk
#almost continuously.  After six minures, peak memory usage had been 
about 700MB and
#I killed the process.  Running this script from the command prompt 
rather than idle
#took 30 minutes, however I do not have a memory usage figure on this:
#gb_file = "C:\\genomes\\Bacteria\\Pirellula_sp\\NC_005027.gbk" 
                     # 18,211 kb

#The following are even larger test examples, that I have not atempted:
#gb_file = 
"C:\\genomes\\Bacteria\\Bradyrhizobium_japonicum\\NC_004463.gbk" 
          # 19,500 kb
#gb_file = 
"C:\\genomes\\Bacteria\\Streptomyces_coelicolor\\NC_003888.gbk" 
          # 24,390 kb

gb_handle = open(gb_file, 'r')

feature_parser = GenBank.FeatureParser()

start_time = time.time()

gb_iterator = GenBank.Iterator(gb_handle, feature_parser)

count = 0
while 1:
     print "Staring...",
     cur_record = gb_iterator.next()
     print "Done"

     if cur_record is None:
         break

     count = count + 1

     # now do something with the record
     print count, cur_record.name, len(cur_record.features), 
len(cur_record.seq)

job_time = time.time() - start_time

print "Time elapsed %0.2f seconds" % job_time





More information about the Biopython-dev mailing list