[Biopython] Problem parsing embl files

Fri May 31 04:43:28 EDT 2013

On Thu, May 30, 2013 at 11:55 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
> Hi Peter,
>
> I checked a similar version with the description of the embl format. They
> are bit ambiguous, I think. From the definition we have:
>
> XX - spacer line                (many per entry)
> SQ - sequence header            (1 per entry)
> CO - contig/construct line      (0 or >=1 per entry)
> bb - (blanks) sequence data     (>=1 per entry)
> // - termination line           (ends each entry; 1 per entry)
>
> At first I read SQ ... (1 per entry) and thought it meant there most be one
> of them. And similar situations for the rest (many per entry, >=1). But for
> example from the same definition we have:
>
> DT - date                       (2 per entry)
>
> But my file doest not have DT and the parser was not complaining about it,
> so it made me think maybe I was doing something wrong. To be honest I can't
> say I'm sure if it means there should be a SQ or if it is optional but can
> only show once per entry.

Well yes, it does seem that missing DT lines is also technically invalid -
but coping with that was quite simple. Missing a sequence in a sequence
centric file format is rather more important ;)

> The files are not mine. Are third party files I got from another researcher,
> who in turn got them from someone else, so... They are annotations for algae
> contigs as far as I know. Not sure why they don't have the sequence part.

I would be interested to know how the files were prepared (e.g. which
tool produced them), but this isn't vital.

> To be honest I don't know if it is worth making changes to the parser. I
> can't say these files are actually well formatted. Maybe someone with more
> experience with embl files can give a second opinion.

Good idea - anyone?

> You think I can cheat the parser if I just 'sed' my embl files and replace
> the \\ with something like:
>
> """XX
> SQ
>
>
> //"""

Possibly - you'd need to do a little experimenting to find out the bare
minimum that would allow the parser to continue without code changes.

Peter