[Biopython] help with parsing EMBL

Mon Apr 26 11:52:28 EDT 2010

Hi Nick,

On Mon, Apr 26, 2010 Nick Leake wrote:
> Hello,
>
> I'm having trouble parsing an embl file (attached) with multiple
> sequences.  I want to be able to access the DNA sequences for
> manipulation and removal from a chromosomal region.  I originally
> thought that I could follow the same fasta format example shown in the
> biopython tutorial.  However, that failed to work.  Next, I tried to
> convert the file to a fastq or a fasta to just follow the examples -
> again, failed.  So, I looked around and found some embl parsing code:
>
> from Bio import SeqIO
>
> p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
> p.next()
> record=p.next()
>
> print record
>
> This kinda works, but fails to read all entries.

Well, yes:

from Bio import SeqIO
#that imports the library

p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl")
#that sets up the EMBL parser (although EMBL files are text so it is a bit
#odd to open it in binary read mode)

p.next() #reads the first record and discards it

record=p.next() #reads the second record and stores as variable record

You only ever try and look at the second record. See below...

> ...  In addition, I don't know what code I need to 'grab' the DNA
> information for manipulations and remove these sequences from
> a given DNA segment.    Can I get a little guidance to
> what I need to do or where I can look to help solve my problem?

What you probably want to start with is a simple for loop,

from Bio import SeqIO
for record in SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41"),"embl"):
    print record.id, record.seq

However, this runs into a problem:

Traceback (most recent call last):
  ...
ValueError: Expected sequence length 2, found 2483.

Looking at your file (which was too big to send to the list), your EMBL
file is invalid. Specifically this is failing on the record which starts:

ID   FROGGER    standard; DNA; INV; 2 BP.

That ID line says the sequence is just 2 base pairs, but in fact the
seems to be 2483bp. The ID line should probably be edited like this:

ID   FROGGER    standard; DNA; INV; 2483 BP.

Fixing that shows up another similar problem,

ID   TV1    standard; DNA; INV; 1728 BP.

should probably be:

ID   TV1    standard; DNA; INV; 1730 BP.

Then there is this record:

ID   DDBARI1    standard; DNA; INV; 1676 BP.

Several parts of the record suggest it should be 1676bp (not just the ID
line, but also for example the SQ line), but there is actually 1677bp of
sequence present.

After making those three edits by hand, Biopython should parse it.
I suspect your EMBL file has been manually edited. Where did it
come from?

Peter