[Biopython] Problem parsing embl files
Peter Cock
p.j.a.cock at googlemail.com
Fri May 31 08:43:28 UTC 2013
On Thu, May 30, 2013 at 11:55 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
> Hi Peter,
>
> I checked a similar version with the description of the embl format. They
> are bit ambiguous, I think. From the definition we have:
>
> XX - spacer line (many per entry)
> SQ - sequence header (1 per entry)
> CO - contig/construct line (0 or >=1 per entry)
> bb - (blanks) sequence data (>=1 per entry)
> // - termination line (ends each entry; 1 per entry)
>
> At first I read SQ ... (1 per entry) and thought it meant there most be one
> of them. And similar situations for the rest (many per entry, >=1). But for
> example from the same definition we have:
>
> DT - date (2 per entry)
>
> But my file doest not have DT and the parser was not complaining about it,
> so it made me think maybe I was doing something wrong. To be honest I can't
> say I'm sure if it means there should be a SQ or if it is optional but can
> only show once per entry.
Well yes, it does seem that missing DT lines is also technically invalid -
but coping with that was quite simple. Missing a sequence in a sequence
centric file format is rather more important ;)
> The files are not mine. Are third party files I got from another researcher,
> who in turn got them from someone else, so... They are annotations for algae
> contigs as far as I know. Not sure why they don't have the sequence part.
I would be interested to know how the files were prepared (e.g. which
tool produced them), but this isn't vital.
> To be honest I don't know if it is worth making changes to the parser. I
> can't say these files are actually well formatted. Maybe someone with more
> experience with embl files can give a second opinion.
Good idea - anyone?
> You think I can cheat the parser if I just 'sed' my embl files and replace
> the \\ with something like:
>
> """XX
> SQ
>
>
> //"""
Possibly - you'd need to do a little experimenting to find out the bare
minimum that would allow the parser to continue without code changes.
Peter
More information about the Biopython
mailing list