[Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Jan 29 17:57:22 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2738





------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-01-29 12:57 EST -------
(In reply to comment #3)
> First, I object to this patch because it replaces the current version without
> keeping the old code.

It does keep the old code, and explicitly uses the old code for the non-simple
locations.

> It should create a new parsing function so verify that
> the old and new versions provide exactly the same output for the same input. 

We should probably extend the Biopython GenBank/EMBL parsing unit tests to make
sure this patch doesn't break anything, and additionally have some extra test
cases using big GenBank files which won't become official unit tests.  This
could be as simple as a script which parses all the records in a set of GenBank
files, printing out a very minimal summary of each feature location (including
subfeatures).  We then run the script with and without the patch, and confirm
their output matches.

Once we are happy that the patch doesn't change the parser behaviour, I don't
see any reason to offer both options to the end user.  In fact, I would prefer
to go further and REMOVE the old slow location parser after extending the
regular expression based parser to cope with ALL location variants.

> As indicated below, it does speed things up! So I have no problems for it to
> replace the current parsing code in the next release provided that the old
> parsing code remains as depreciated function. (Alternatively add a conditional
> statement with a flag to avoid this new code as required.) 

Having the new code controlled by some option would actually be pretty easy. 
Other than for testing I see no reason to do this.

> I ran the script on patched version of Linux Python (versions 2.3, 2.4, 2.5
> and 2.6) and noted that this halved the time required to parse a Genbank
> Incremental Update file (an update from Jan 2009: nc0101.flat size 573 mb)
> with 213942 records with total length 158245604 bp). 

That is consistent with the speed ups I have seen - you can get even more
depending on the proportion of features in the file.  Thanks for checking
python 2.3 to 2.6, nice to see they all benefit.

> While the number of records and sequences are the same, I have not checked if
> the patched version is providing exactly the same output as the unpatched
> version. This is very important for the different types of GenBank files
> (Whole Genome Shotgun and CON types).

I agree through testing is important here.  Would you like to suggest any
particular WGS or CON files for testing with?  I'm thinking something large
with a wide range of location types would be good for checking this patch (but
not to include with Biopython).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list