[BioPython] Cannot parse/convert embl formatted files

Peter biopython at maubp.freeserve.co.uk
Sun Aug 13 22:32:53 UTC 2006


Martin MOKREJŠ wrote:
> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
> and such value never exist in original GenBank ... you're the judge here.

I've had a look at bug 2072 and for that example it looks like the
BioPerl converter tried to squeeze  "genomic DNA" into what I thought
was a seven character field (or eight if you allow it to steal the
following space).  The extra characters seem to have pushed the later
fields of "linear", division "FUN" and date out of position.

How is your Perl?  You could try:

(a) Editing the BioPerl conversion script to make a few substitutions
to the sequence type like "genomic DNA" or "unassigned DNA" to just
"DNA"

Or,

(b) Editing the input EMBL file to make the same change in the ID line
at the start of each record.

Peter




More information about the Biopython mailing list