[Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri Jan 30 11:00:24 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2738





------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2009-01-30 06:00 EST -------
(In reply to comment #6)
> Created an attachment (id=1210)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1210&action=view) [details]
> Single test case that is not correctly parsed
> 
> I just used a simple 'print record' followed by a diff (but that does not
> check the references). This record (and related ones) has a difference
> between versions ...

If you do a 'print record' with a SeqRecord object, any references are shown
using their __repr__ string - which is currently the python object default
which includes a memory address (something I've been meaning to address on Bug
2544).  Different objects will have different memory locations, which will show
up in the diff.

For example, using the following as a simple test script and capturing its
output to files:

from Bio import SeqIO
record = SeqIO.read(open("CY029873.gbk"), "genbank")
print record

Running diff with and without the patch gave me:

9c9
< /references=[<Bio.SeqFeature.Reference instance at 0xb7b7bfcc>,
<Bio.SeqFeature.Reference instance at 0xb7b8412c>]
---
> /references=[<Bio.SeqFeature.Reference instance at 0x866b04c>, <Bio.SeqFeature.Reference instance at 0x866b18c>]

i.e. No real differences between the records as far as I can see.  Please
clarify - if you have found a failing example I would be most interested.

(In reply to comment #7)
> I downloaded a few example files including WGS and CON. I found that CON files
> are not parsed by either version. Not a surprise given that these have no
> sequences but that is a different topic. Apart from the errors in attached
> case, I have not seen any other errors (even parsing the references).

Could you clarify your problem with the CON files please (on a new bug, or the
mailing list - since as you point out this is a different topic).  I've just
downloaded and unzipped one of the smaller CON files and it parses fine for me:
ftp://ftp.ncbi.nih.gov/genbank/gbcon107.seq.gz

>>> from Bio import SeqIO
>>> count = 0
>>> for record in SeqIO.parse(open("gbcon107.seq"),"genbank") : count += 1
...
>>> print count
55031

As expected there is no sequence, but the name, description, features,
references etc are there.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list