[Biopython] problem parsing embl file

Peter biopython at maubp.freeserve.co.uk
Mon Jun 28 19:56:42 UTC 2010


Hi Sameet,

On Mon, Jun 28, 2010 at 8:20 PM, Sameet Mehta <msameet at gmail.com> wrote:
> Hi,
>
> I am trying to parse a EMBL file created in 2004.  The file contains a
> single record for the entire chromosome.  I have tried the following
> two approaches
>
> r = SeqIO.parse( file( "chromosome1.contig.embl" ), "embl" ).next()
> r = SeqIO.read( file( "chromosome1.contig.embl" ), "embl" )

Those look fine - if you are using Biopython 1.54 you can just
use the filename rather then opening it explicitly.

> I get the following error:
> ValueError                                Traceback (most recent call last)
> ...
> ValueError: Expected sequence length 666, found 5580032.
>
> Can you tell me if i am doing anything wrong.  I am following the
> instructions as given in the Bio.SeqIO wiki page.

No, your code is fine. It looks like you have a broken EMBL file.
Could you show me the first few lines of the EMBL file, and also
have a look at it in a text editor to see if the sequence length
really is 666bp, or 5580032 as Biopython thinks?

(Or send the whole EMBL file to me off list?)

In any case, that check seemed a bit strict (I've seen several
examples of unofficial GenBank or EMBL files where the
sequence length didn't match the header) so I relaxed this
check to a warning for Biopython 1.54. You could try updating
your copy of Biopython and see if it will accept the file then?

Regards,

Peter




More information about the Biopython mailing list