[BioPython] Re: [Biopython-dev] Notification: incoming/109

Andreas Kuntzagk andreas.kuntzagk@mdc-berlin.de
14 Jan 2003 16:38:28 +0100


> > >From andreas.kuntzagk@mdc-berlin.de Tue Dec 10 12:31:46 2002
> [...]
> > 
> > While parsing the recent GenBank-Release, I got followin error:
> > 
> > 
> > >>> from Bio import GenBank
> > >>> f=file("gbest1.seq")
> > 
> > >>> GenBank.Iterator(f,has_header=1)
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> >   File "/usr/lib/python2.2/site-packages/Bio/GenBank/__init__.py", line 171, in
> > __init__
> >     self._reader = RecordReader.StartsWith(handle, "LOCUS")
> >   File "/usr/lib/python2.2/site-packages/Martel/RecordReader.py", line 133, in
> > __init__
> >     self.tagtable)
> >   File "/usr/lib/python2.2/site-packages/Martel/RecordReader.py", line 92, in
> > _find_begin_positions
> >     raise ReaderError("invalid format starting with %s" % repr(text[:50]))
> > Martel.RecordReader.ReaderError: invalid format starting with 'DEFINITION 
> > zd84h07.s1 Soares_fetal_heart_NbHH19W '
> > 
> > Problems seems, that in this file there is only one empty line after the
> > "...reported sequences" instead of the expected two lines.
> > 
> 
> I would suggest the following patch. This reads all text from the handle
> into a string (which can consume quit some memory :-( ) and skips to the
> first LOCUS. All remaining text is the turned into a StrinIO (would
> cStringIO better?)
[patch deleted]

Answering myself again. Here is a better patch (against the
biopython-1.10). Using cStringIO only when the handle doesn't have a
seek, I read to the first "LOCUS" and then 'unread' the last line.
This gives also more flexibility for the structure of the header.

Is there anybody else there working with full GenBank-Releases and can
confirm this patch? 

---patch---

# diff GenBank/__init__.py ~/biopython-1.10/Bio/GenBank/
162,166d161
<             try:
<                 handle.__getattribute__("seek") #Need seek to place
file-position back after reading "LOCUS"
<             except:
<                 import cStringIO #if there is no seek, we read all
into a string and use a StringIO
<                 handle=cStringIO.StringIO(handle.read())
169,170c164
<                 if cur_line.startswith("LOCUS") or cur_line=="":
<                     handle.seek(-len(cur_line),1)
---
>                 if cur_line.find("reported sequences") >= 0:
171a166,169
> 
>             # read off two more lines and we are ready to go
>             handle.readline()
>             handle.readline()