[BioPython] Cannot parse/convert embl formatted files

Sat Aug 12 21:35:08 UTC 2006

Peter wrote:
>>Can you download the same data in GenBank format from another source
>>like the NCBI instead?

Martin MOKREJŠ wrote:
> No, it contains some extra annotation provided by that Italian site.
> I managed to get it converted using bp_sreformat.pl to GenBank and
> made biopython GenBank parser to parse it with some minor problems.
>
>
> I do not know what is the general opinion but I observed errors with
> file-input. I understand it is better to fix the input file format
> but thought that maybe biopython could internally append the missing
> `"' character at the end of the line when a new feature is met on the
> next line:
>
> 5UTRef.Pln.dat
> Unbalanced quote in:
> /source="REFSEQ::XM_479174:1..213"
> /gene="B1056G08.147"
> /product="putative dihydropterin pyrophosphokinase
> No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0> line 815235.
>

And the relevant EBML file was:
> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
> ...
> FT   5'UTR           1..213
> FT                   /source="REFSEQ::XM_479174:1..213"
> FT                   /gene="B1056G08.147"
> FT                   /product="putative dihydropterin pyrophosphokinase
> FT   repeat_region   61..87
> ...
> //
>
> I think the parser also problem with the continuation line ... but am not sure
> now. Test yourself if you want. ;-)

I've not used BioPerl, but it is complaining that the EMBL file you
are trying to convert has an unclosed quote for the product
annotation.

I would regard this EMBL file (and the GenBank equivalent) as "wrong"
but would hope that our GenBank parser could cope with this.  I have
not checked...

> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
> and such value never exist in original GenBank ... you're the judge here.

Probably those variants level turn up in an "official" GenBank file.
In which case, cleaning up the locus line should be part of the EMBL
to GenBank conversion.

I would be interested to see a couple of your EMBL and converted
GenBank files.  Could you email me a few (small) examples directly -
NOT to the whole mailing list please as I don't want to clog up
everyone's inboxes).

> Last comment: it took me ages to figure with the sparse documentation that
> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
> the LOCUS value. Still don't know how to get the DEFINITION value.

It sounds like you used the Bio.GenBank.FeatureParser to get a
Bio.SeqRecord object.  In this case the record id usually comes from
the VERSION line by default (and is normally the accession number with
a dot and a version number appended).  If this is missing, then the
first ACCESSION line is used.  As far as I can tell, any additional
ACCESSION lines are lost.

If you had used the Bio.GenBank.RecordParser to get a GenBank Record
object then it might have been a little easier.  The ACCESSION line(s)
should be in the list cur_record.accession

In either case, I think the DEFINITION line in a GenBank file can be
accessed as cur_record.description (but I haven't tried that as my
dinner is getting cold).

Peter