[Biopython] Problem parsing embl files

Jaime Tovar jmtc21 at bath.ac.uk
Fri May 31 15:12:32 EDT 2013


Thanks Peter,

I found gff3 files I can easily parse for the data I need. So will leave 
this strange embl files alone. If someone with more experience with embl 
files wants to take a look at them to check the parser let me know and I 
will forward some sample files.

I asked the people who gave me the files if they know what kind of 
software they used to generate them. But since they are third party data 
they have no information. I was lucky they had also gff3 files to get 
gene annotation data.

Jaime.

On 31/05/2013 09:43, Peter Cock wrote:
> On Thu, May 30, 2013 at 11:55 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
>> Hi Peter,
>>
>> I checked a similar version with the description of the embl format. They
>> are bit ambiguous, I think. From the definition we have:
>>
>> XX - spacer line                (many per entry)
>> SQ - sequence header            (1 per entry)
>> CO - contig/construct line      (0 or >=1 per entry)
>> bb - (blanks) sequence data     (>=1 per entry)
>> // - termination line           (ends each entry; 1 per entry)
>>
>> At first I read SQ ... (1 per entry) and thought it meant there most be one
>> of them. And similar situations for the rest (many per entry, >=1). But for
>> example from the same definition we have:
>>
>> DT - date                       (2 per entry)
>>
>> But my file doest not have DT and the parser was not complaining about it,
>> so it made me think maybe I was doing something wrong. To be honest I can't
>> say I'm sure if it means there should be a SQ or if it is optional but can
>> only show once per entry.
> Well yes, it does seem that missing DT lines is also technically invalid -
> but coping with that was quite simple. Missing a sequence in a sequence
> centric file format is rather more important ;)
>
>> The files are not mine. Are third party files I got from another researcher,
>> who in turn got them from someone else, so... They are annotations for algae
>> contigs as far as I know. Not sure why they don't have the sequence part.
> I would be interested to know how the files were prepared (e.g. which
> tool produced them), but this isn't vital.
>
>> To be honest I don't know if it is worth making changes to the parser. I
>> can't say these files are actually well formatted. Maybe someone with more
>> experience with embl files can give a second opinion.
> Good idea - anyone?
>
>> You think I can cheat the parser if I just 'sed' my embl files and replace
>> the \\ with something like:
>>
>> """XX
>> SQ
>>
>>
>> //"""
> Possibly - you'd need to do a little experimenting to find out the bare
> minimum that would allow the parser to continue without code changes.
>
> Peter



More information about the Biopython mailing list