[BioPython] Entrez.efetch

Peter Cock p.j.a.cock at googlemail.com
Wed Oct 8 13:46:25 UTC 2008


Stephan wrote:
>> When I download this chromosome manually from the NCBI-website,
>> I indeed find a difference in one line, namely in line 3 of the
>> genbank file. In the manually downloaded file line 3 reads:
>> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
>> from my code I have only: "ACCESSION NC_004353". So without that
>> region-information, the biopython parser of course runs to a premature
>> end.

Stephan - when you say manually, do you mean via a web browser?  If so
it is likely to be using a subtly different URL, which might explain
the NCBI generating slightly different data on the fly.  Either way,
this ACCESSION line difference shouldn't trigger the "Premature end of
file in sequence data" error in the GenBank parser.

On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> This is a tricky problem that I ran into as well and is fixed in the
> latest CVS version. The issue is that the Biopython reader is using an
> UndoHandle instead of a standard python handle. By default some of these
> operations appear to be assuming an iterator, but UndoHandle did not
> provide this.

Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle.
Just adding the close made Stephan's example work for me.  What
exactly was the problem you ran into (one of the other parsers
perhaps?).

> As a result, you can lose the first couple of lines which are
> previously examined to determine the filetype. The fix is to make
> this a proper iterator. You can either check out current CVS, or
> make the addition manually to Bio/File.py in your current version:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython

Adding this to the UndoHandle seems a sensible improvement - but I
don't see how it can affect Stephan's script.

Peter



More information about the Biopython mailing list