[Biopython] Problem parsing embl files
Jaime Tovar
jmtc21 at bath.ac.uk
Thu May 30 18:55:28 EDT 2013
Hi Peter,
I checked a similar version with the description of the embl format.
They are bit ambiguous, I think. From the definition we have:
XX - spacer line (many per entry)
SQ - sequence header (1 per entry)
CO - contig/construct line (0 or >=1 per entry)
bb - (blanks) sequence data (>=1 per entry)
// - termination line (ends each entry; 1 per entry)
At first I read SQ ... (1 per entry) and thought it meant there most be
one of them. And similar situations for the rest (many per entry, >=1).
But for example from the same definition we have:
DT - date (2 per entry)
But my file doest not have DT and the parser was not complaining about
it, so it made me think maybe I was doing something wrong. To be honest
I can't say I'm sure if it means there should be a SQ or if it is
optional but can only show once per entry.
The files are not mine. Are third party files I got from another
researcher, who in turn got them from someone else, so... They are
annotations for algae contigs as far as I know. Not sure why they don't
have the sequence part.
To be honest I don't know if it is worth making changes to the parser. I
can't say these files are actually well formatted. Maybe someone with
more experience with embl files can give a second opinion.
You think I can cheat the parser if I just 'sed' my embl files and
replace the \\ with something like:
"""XX
SQ
//"""
I didn't know github had gist :) I have some animadversion against
github so I never use them :D
Thanks for the help!
Jaime.
On 30/05/2013 23:03, Peter Cock wrote:
> On Thu, May 30, 2013 at 8:48 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
>> Hi all,
>>
>> Is the first time I try to parse embl files with biopython. I'm trying to
>> get the gene ids and coordinates for start/end of each gene.
>>
>> I thought it will be straight forward like with other annotation files, so I
>> did a small script to test it.
>>
>> from Bio import SeqIO
>> if __name__ == '__main__':
>> handle = open("sctg_0.embl", "r")
>> records = SeqIO.parse(handle, "embl")
>> for record in records :
>> print(record)
>>
>> But when running the script I get an error which may suggest the embl files
>> have an issue
>>
>> ValueError: Premature end of features table, marker '//' found
>>
>> I checked the source code of the parser and seems the embl file has
>> problems, but when I checked embl file format seems they are ok.
> If they are like your example, they are a bit unusual.
>
>> I have a
>> few thousand files formatted in the same way. So can't think about other way
>> to deal with the problem but to parse them.
>>
>> The annotation files have only annotation info, no sequences. Here I
>> uploaded an example.
>>
>> http://depositfiles.com/files/481uob95e
>>
>> I'm using python 2.7.4 and biopython 1.61 on a win x64 computer.
>>
>> Any advice and suggestion will be greatly appreciated.
>>
>> Jaime.
> Hi Jamie,
>
> For sharing plain text files, http://gist.github.com is a nicer option.
>
> The problem is your file looks like this:
>
> ID sctg_0 standard; DNA; DIV; 3745584 BP.
> XX
> AC sctg_0;
> XX
> FH Key Location/Qualifiers
> FH
> FT CDS 302..490
> FT /note="EuGene predicted gene nr: Esi0000_0001"
> ...
> FT mRNA complement(3744791..3745584)
> FT /note="EuGene predicted gene nr: Esi0000_0662"
> //
>
> The parser is expecting an SQ line after the FT lines before the //
> As you said, your files lack any sequence information - is that deliberate?
>
> This is not something I've seen before, but we can probably
> modify the EMBL parser to cope with this - much like how
> GenBank files can omit the actual sequence data.
>
> On the other hand, the SQ line is not defined as optional
> so perhaps we are doing the right thing and rejecting an
> invalid file? ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt
>
> Where did your EMBL format file come from?
>
> Thanks,
>
> Peter
More information about the Biopython
mailing list