[Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files

Tue Mar 8 17:04:21 EST 2005

http://bugzilla.open-bio.org/show_bug.cgi?id=1747

------- Additional Comments From biopython-bugzilla at maubp.freeserve.co.uk  2005-03-08 17:04 -------
Created an attachment (id=198)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=198&action=view)
Patch to the class _Scanner in Bio/GenBank/__init__.py

The following patch seems to pass all existing GenBank tests, and my own
testing.

The only change is to the class _Scanner in Bio/GenBank/__init__.py

This is a patch created on Windows XP with the Cygwin diff command against file
revision 1.53 (as shipped with BioPython 1.30 and 1.40b):

diff my_version.py vcs_version.py > patch.txt

Instead of using Martel to parse entire genbank records, it is only used to
parse the "header section" from the LOCUS line to the FEATURES line.  The
features themselves, and the nucleotide sequences, are parsed with new python
code.

i.e. Rather than rewriting Martel, I re-wrote the GenBank scanner.

It should be possible to go further and not use Martel at all, probably another
afternoon's work for me.

Updates timings to follow, but in general both time taken and memory required
are more than halved - which is nice!

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.