[Biopython] Entrez.efetch

Fri Feb 26 10:59:42 UTC 2010

On Fri, Feb 26, 2010 at 2:33 AM, Rohan Maddamsetti
<rohan.maddamsetti at gmail.com> wrote:
> Hello,
>
> I'm new to biopython (installed yesterday), so please bear with me. This
> problem is similar to one sent to list on Wed, Oct 8, 2008 with the same
> subject line as this email, by a Stephan. Interestingly, though, my code
> works in a couple cases (including the chromosome input used by Stephan),
> but not in a third. I wrote the following simple function.
>
> def parseGenome(genbank_id):
>    handle = Entrez.efetch(db="genome",rettype="gb",id=genbank_id)
>    for seq_record in SeqIO.parse(handle,"gb"):
>        print "%s with %i features" % (seq_record.id,
> len(seq_record.features))
>    handle.close()
>
> ##Try on E. coli
> genome:
> parseGenome("CP000819.1")
> ##Try on Drosophila chromosome 4
> parseGenome("NC_004353.3")
> ##Try on Drosophila X chromosome
> parseGenome("NC_004354")
>
> And this is the output I get:
>
> CP000819.1 with 8759 features
> NC_004353.3 with 1191 features
> Traceback (most recent call last):
> ...
> ValueError: Premature end of file in sequence data
>
> Is this a bug, or am I doing something wrong? My eventual goal is to iterate
> through the features in the seq_record, and collect GC content statistics
> for the coding regions and introns.

I was able to run your example - but it is quite slow:

CP000819.1 with 8759 features
NC_004353.3 with 1191 features
NC_004354.3 with 10397 features

In this case the Drosophila X chromosome is a 32MB GenBank file,
and I guess you had a network problem resulting in a partial download.
This would explain the error from the parser, "Premature end of file in
sequence data".

I would say you did something wrong - downloading and parsing large
files on the fly isn't a great idea. You should download them once, save
them disk, and then parse the local file. Also for genomes I would use
the NCBI's FTP site rather than Entrez (i.e. HTTP). The NCBI have
guidance/scripts on setting up a local mirror and keeping it up to date.

In your case, since you will be fine tuning your script to do the GC
statistics for the coding regions etc, this will take a while to get just
right - so you really should be parsing a local file.

I hope that helps,

Peter