[Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing

Thu Jan 29 17:41:19 UTC 2009

http://bugzilla.open-bio.org/show_bug.cgi?id=2738

------- Comment #3 from bsouthey at gmail.com  2009-01-29 12:41 EST -------
First, I object to this patch because it replaces the current version without
keeping the old code. It should create a new parsing function so verify that
the old and new versions provide exactly the same output for the same input. 

As indicated below, it does speed things up! So I have no problems for it to
replace the current parsing code in the next release provided that the old
parsing code remains as depreciated function. (Alternatively add a conditional
statement with a flag to avoid this new code as required.) 

(In reply to comment #2)
> Created an attachment (id=1208)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1208&action=view) [details]
> Simple test script for timing GenBank parsing
> 
> I've attached a trivial script to time parsing all the GenBank files in 
> directory to help anyone wanting to benchmark this change.
> 
> (In reply to comment #1)
> > However, from my limited testing using Python 2.5 on the Mac with GenBank
> > files for large bacterial genomes, this may be a price worth paying.  I'll
> > like independent measurements (and to check this on other platforms), but
> > this does seem to more than halve the time taken to parse GenBank files!
> 
> Further testing with Python 2.5 on Linux, this time also with some large
> Eurakyotics files, appears to confirm a very large speed up (most obvious on
> feature rich GenBank files of course).
> 
> I still want to check this on other versions of python...
> 

I ran the script on patched version of Linux Python (versions 2.3, 2.4, 2.5 and
2.6) and noted that this halved the time required to parse a Genbank
Incremental Update file (an update from Jan 2009: nc0101.flat size 573 mb) with
213942 records with total length 158245604 bp). 

While the number of records and sequences are the same, I have not checked if
the patched version is providing exactly the same output as the unpatched
version. This is very important for the different types of GenBank files (Whole
Genome Shotgun and CON types).

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.