[Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files

Tue Mar 8 16:59:34 EST 2005

http://bugzilla.open-bio.org/show_bug.cgi?id=1747

------- Additional Comments From biopython-bugzilla at maubp.freeserve.co.uk  2005-03-08 16:59 -------
See also Andrew Dalke's comment on this bug:

http://www.biopython.org/pipermail/biopython-dev/2005-February/001910.html

I wrote:

> I filed bug 1747 as "major" and feel it renders the GenBank parser
> effectively useless for large genomes.

Andrew replied:

I saw that bug report when it came in a couple weeks ago but I was busy
at a client site.

One of the fundamental problems with this implementation of Martel
is that it parses a record in memory and uses about 4x as much memory
as the record.  The slowness for large records comes from hitting
swap.  It can't be fixed without some non-trivial changes to Martel;
basically a rewrite.  If anyone wants to tackle rewriting a regex
engine I have some comments about what needs to be done.  As for me
I haven't touched the code in years because I haven't needed that
capability and other tasks (including paying work) keep me busy.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.