[Biopython-dev] [Bug 1747] New: GenBank parser is very slow and memory hungry for large input files

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Mon Feb 7 08:15:14 EST 2005


           Summary: GenBank parser is very slow and memory hungry for large
                    input files
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk

Tested on BioPython 1.30 with Python 2.3

The GenBank parser (using Martel) appears to work on all the bacterial genomes I
have tested, but only if the machine has sufficent RAM.  The memory consumption
is incredibly high when attempting to load the larger genbank files.

I would argue the severity is "Blocker" on the grounds that the GenBank parser
is effectively useless for large genomes.

The high memory usage is also suggestive of a severe memory leak.

As reported to the mailing list:


The following times and memory usage figures are on Windows 2000,
Python 2.3.3 using the GenBank Iterator (script running from Idle):-

NC_003065.gbk     480 kb,   4 seconds,  28 MB RAM
NC_003064.gbk   1,217 kb,  11 seconds,  56 MB RAM
NC_000854.gbk   3,391 kb,  45 seconds, 165 MB RAM
NC_003063.gbk   4,725 kb,  55 seconds, 195 MB RAM
NC_003062.gbk   6,574 kb,  88 seconds, 268 MB RAM
NC_005966.gbk   8,858 kb, 139 seconds, 372 MB RAM
NC_000913.gbk  10,267 kb, 171 seconds, 409 MB RAM
NC_000962.gbk  11,010 kb, 200 seconds, 486 MB RAM
NC_003997.gbk  12,026 kb, 228 seconds, 496 MB RAM
NC_002678.gbk  15,120 kb, 306 seconds, 586 MB RAM
NC_005027.gbk  18,211 kb, not enough RAM
NC_004463.gbk  19,500 kb, not enough RAM
NC_003888.gbk  24,390 kb, not enough RAM

The above computer was a 2.26 GHz Intel Pentium 4, with 735 MB RAM.

Memory usage figures are recorded by hand from TaskManager's process view at the
end of the program run.

For larger files (e.g. NC_005027.gbk at 18,211 kb) the system would run out of
memory and begin paging to disk.  For this particular example, the test
eventually completed in half an hour.

I repeated this test on a multiuser dual Xeon linux box with Python 2.3.4 for
the first few files, to confirm this is not just a windows problem:

NC_003065.gbk    480 kb,  5 seconds,  ~24 MB RAM
NC_003064.gbk  1,217 kb, 13 seconds,  ~51 MB RAM
NC_000854.gbk  3,391 kb, 54 seconds, ~154 MB RAM
NC_003063.gbk  4,725 kb, 62 seconds, ~183 MB RAM
NC_003062.gbk  6,574 kb, 97 seconds, ~251 MB RAM

The unix "time" command did not report memory usage, the above figures were
recorded by hand from watching the process using "top".

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

More information about the Biopython-dev mailing list