[Biopython] Problem parsing embl files

Jaime Tovar jmtc21 at bath.ac.uk
Thu May 30 18:55:28 EDT 2013


Hi Peter,

I checked a similar version with the description of the embl format. 
They are bit ambiguous, I think. From the definition we have:

XX - spacer line                (many per entry)
SQ - sequence header            (1 per entry)
CO - contig/construct line      (0 or >=1 per entry)
bb - (blanks) sequence data     (>=1 per entry)
// - termination line           (ends each entry; 1 per entry)

At first I read SQ ... (1 per entry) and thought it meant there most be 
one of them. And similar situations for the rest (many per entry, >=1). 
But for example from the same definition we have:

DT - date                       (2 per entry)

But my file doest not have DT and the parser was not complaining about 
it, so it made me think maybe I was doing something wrong. To be honest 
I can't say I'm sure if it means there should be a SQ or if it is 
optional but can only show once per entry.

The files are not mine. Are third party files I got from another 
researcher, who in turn got them from someone else, so... They are 
annotations for algae contigs as far as I know. Not sure why they don't 
have the sequence part.

To be honest I don't know if it is worth making changes to the parser. I 
can't say these files are actually well formatted. Maybe someone with 
more experience with embl files can give a second opinion.

You think I can cheat the parser if I just 'sed' my embl files and 
replace the \\ with something like:

"""XX
SQ


//"""

I didn't know github had gist :) I have some animadversion against 
github so I never use them :D

Thanks for the help!

Jaime.

On 30/05/2013 23:03, Peter Cock wrote:
> On Thu, May 30, 2013 at 8:48 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
>> Hi all,
>>
>> Is the first time I try to parse embl files with biopython. I'm trying to
>> get the gene ids and coordinates for start/end of each gene.
>>
>> I thought it will be straight forward like with other annotation files, so I
>> did a small script to test it.
>>
>> from Bio import SeqIO
>> if __name__ == '__main__':
>>      handle = open("sctg_0.embl", "r")
>>      records = SeqIO.parse(handle, "embl")
>>      for record in records :
>>          print(record)
>>
>> But when running the script I get an error which may suggest the embl files
>> have an issue
>>
>> ValueError: Premature end of features table, marker '//' found
>>
>> I checked the source code of the parser and seems the embl file has
>> problems, but when I checked embl file format seems they are ok.
> If they are like your example, they are a bit unusual.
>
>> I have a
>> few thousand files formatted in the same way. So can't think about other way
>> to deal with the problem but to parse them.
>>
>> The annotation files have only annotation info, no sequences. Here I
>> uploaded an example.
>>
>> http://depositfiles.com/files/481uob95e
>>
>> I'm using python 2.7.4 and biopython 1.61 on a win x64 computer.
>>
>> Any advice and suggestion will be greatly appreciated.
>>
>> Jaime.
> Hi Jamie,
>
> For sharing plain text files, http://gist.github.com is a nicer option.
>
> The problem is your file looks like this:
>
> ID   sctg_0 standard; DNA; DIV; 3745584 BP.
> XX
> AC   sctg_0;
> XX
> FH   Key             Location/Qualifiers
> FH
> FT   CDS                302..490
> FT                   /note="EuGene predicted gene nr: Esi0000_0001"
> ...
> FT   mRNA               complement(3744791..3745584)
> FT                   /note="EuGene predicted gene nr: Esi0000_0662"
> //
>
> The parser is expecting an SQ line after the FT lines before the //
> As you said, your files lack any sequence information - is that deliberate?
>
> This is not something I've seen before, but we can probably
> modify the EMBL parser to cope with this - much like how
> GenBank files can omit the actual sequence data.
>
> On the other hand, the SQ line is not defined as optional
> so perhaps we are doing the right thing and rejecting an
> invalid file? ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt
>
> Where did your EMBL format file come from?
>
> Thanks,
>
> Peter



More information about the Biopython mailing list