[Biopython-dev] [Bug 1747] New: GenBank parser is very slow and
memory hungry for large input files
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Mon Feb 7 08:15:14 EST 2005
http://bugzilla.open-bio.org/show_bug.cgi?id=1747
Summary: GenBank parser is very slow and memory hungry for large
input files
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: All
Status: NEW
Severity: major
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
Tested on BioPython 1.30 with Python 2.3
The GenBank parser (using Martel) appears to work on all the bacterial genomes I
have tested, but only if the machine has sufficent RAM. The memory consumption
is incredibly high when attempting to load the larger genbank files.
I would argue the severity is "Blocker" on the grounds that the GenBank parser
is effectively useless for large genomes.
The high memory usage is also suggestive of a severe memory leak.
As reported to the mailing list:
http://www.biopython.org/pipermail/biopython-dev/2005-January/002881.html
http://www.biopython.org/pipermail/biopython-dev/2005-January/002889.html
The following times and memory usage figures are on Windows 2000,
Python 2.3.3 using the GenBank Iterator (script running from Idle):-
NC_003065.gbk 480 kb, 4 seconds, 28 MB RAM
NC_003064.gbk 1,217 kb, 11 seconds, 56 MB RAM
NC_000854.gbk 3,391 kb, 45 seconds, 165 MB RAM
NC_003063.gbk 4,725 kb, 55 seconds, 195 MB RAM
NC_003062.gbk 6,574 kb, 88 seconds, 268 MB RAM
NC_005966.gbk 8,858 kb, 139 seconds, 372 MB RAM
NC_000913.gbk 10,267 kb, 171 seconds, 409 MB RAM
NC_000962.gbk 11,010 kb, 200 seconds, 486 MB RAM
NC_003997.gbk 12,026 kb, 228 seconds, 496 MB RAM
NC_002678.gbk 15,120 kb, 306 seconds, 586 MB RAM
NC_005027.gbk 18,211 kb, not enough RAM
NC_004463.gbk 19,500 kb, not enough RAM
NC_003888.gbk 24,390 kb, not enough RAM
The above computer was a 2.26 GHz Intel Pentium 4, with 735 MB RAM.
Memory usage figures are recorded by hand from TaskManager's process view at the
end of the program run.
For larger files (e.g. NC_005027.gbk at 18,211 kb) the system would run out of
memory and begin paging to disk. For this particular example, the test
eventually completed in half an hour.
I repeated this test on a multiuser dual Xeon linux box with Python 2.3.4 for
the first few files, to confirm this is not just a windows problem:
NC_003065.gbk 480 kb, 5 seconds, ~24 MB RAM
NC_003064.gbk 1,217 kb, 13 seconds, ~51 MB RAM
NC_000854.gbk 3,391 kb, 54 seconds, ~154 MB RAM
NC_003063.gbk 4,725 kb, 62 seconds, ~183 MB RAM
NC_003062.gbk 6,574 kb, 97 seconds, ~251 MB RAM
The unix "time" command did not report memory usage, the above figures were
recorded by hand from watching the process using "top".
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list