[Biopython] Problem parsing embl files

Thu May 30 22:03:21 UTC 2013

On Thu, May 30, 2013 at 8:48 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
> Hi all,
>
> Is the first time I try to parse embl files with biopython. I'm trying to
> get the gene ids and coordinates for start/end of each gene.
>
> I thought it will be straight forward like with other annotation files, so I
> did a small script to test it.
>
> from Bio import SeqIO
> if __name__ == '__main__':
>     handle = open("sctg_0.embl", "r")
>     records = SeqIO.parse(handle, "embl")
>     for record in records :
>         print(record)
>
> But when running the script I get an error which may suggest the embl files
> have an issue
>
> ValueError: Premature end of features table, marker '//' found
>
> I checked the source code of the parser and seems the embl file has
> problems, but when I checked embl file format seems they are ok.

If they are like your example, they are a bit unusual.

> I have a
> few thousand files formatted in the same way. So can't think about other way
> to deal with the problem but to parse them.
>
> The annotation files have only annotation info, no sequences. Here I
> uploaded an example.
>
> http://depositfiles.com/files/481uob95e
>
> I'm using python 2.7.4 and biopython 1.61 on a win x64 computer.
>
> Any advice and suggestion will be greatly appreciated.
>
> Jaime.

Hi Jamie,

For sharing plain text files, http://gist.github.com is a nicer option.

The problem is your file looks like this:

ID   sctg_0 standard; DNA; DIV; 3745584 BP.
XX
AC   sctg_0;
XX
FH   Key             Location/Qualifiers
FH
FT   CDS                302..490
FT                   /note="EuGene predicted gene nr: Esi0000_0001"
...
FT   mRNA               complement(3744791..3745584)
FT                   /note="EuGene predicted gene nr: Esi0000_0662"
//

The parser is expecting an SQ line after the FT lines before the //
As you said, your files lack any sequence information - is that deliberate?

This is not something I've seen before, but we can probably
modify the EMBL parser to cope with this - much like how
GenBank files can omit the actual sequence data.

On the other hand, the SQ line is not defined as optional
so perhaps we are doing the right thing and rejecting an
invalid file? ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

Where did your EMBL format file come from?

Thanks,

Peter