[BioPython] Cannot parse/convert embl formatted files

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Sat Aug 12 21:49:20 UTC 2006


Hi Peter,


Peter wrote:
> Peter wrote:
> 
>>> Can you download the same data in GenBank format from another source
>>> like the NCBI instead?
> 
> 
> Martin MOKREJŠ wrote:
> 
>> No, it contains some extra annotation provided by that Italian site.
>> I managed to get it converted using bp_sreformat.pl to GenBank and
>> made biopython GenBank parser to parse it with some minor problems.
>>
>>
>> I do not know what is the general opinion but I observed errors with
>> file-input. I understand it is better to fix the input file format
>> but thought that maybe biopython could internally append the missing
>> `"' character at the end of the line when a new feature is met on the
>> next line:
>>
>> 5UTRef.Pln.dat
>> Unbalanced quote in:
>> /source="REFSEQ::XM_479174:1..213"
>> /gene="B1056G08.147"
>> /product="putative dihydropterin pyrophosphokinase
>> No further qualifiers will be added for this feature at
>> /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0>
>> line 815235.
>>
> 
> And the relevant EBML file was:
> 
>> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
>> ...
>> FT   5'UTR           1..213
>> FT                   /source="REFSEQ::XM_479174:1..213"
>> FT                   /gene="B1056G08.147"
>> FT                   /product="putative dihydropterin pyrophosphokinase
>> FT   repeat_region   61..87
>> ...
>> //
>>
>> I think the parser also problem with the continuation line ... but am
>> not sure
>> now. Test yourself if you want. ;-)
> 
> 
> I've not used BioPerl, but it is complaining that the EMBL file you
> are trying to convert has an unclosed quote for the product
> annotation.
> 
> I would regard this EMBL file (and the GenBank equivalent) as "wrong"
> but would hope that our GenBank parser could cope with this.  I have
> not checked...

Nice to hear that. Maybe it should spit-out some warning so one could use
the out also to verify generated files. Probably such less-strict mode should
be configurable option of the parser.

> 
>> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
>> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
>> and such value never exist in original GenBank ... you're the judge here.
> 
> 
> Probably those variants level turn up in an "official" GenBank file.
> In which case, cleaning up the locus line should be part of the EMBL
> to GenBank conversion.

Sounds reasonable.

> 
> I would be interested to see a couple of your EMBL and converted
> GenBank files.  Could you email me a few (small) examples directly -
> NOT to the whole mailing list please as I don't want to clog up
> everyone's inboxes).

Will do after I re-create those broken resulting files. I had to edit
them manually.

> 
>> Last comment: it took me ages to figure with the sparse documentation
>> that
>> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
>> the LOCUS value. Still don't know how to get the DEFINITION value.
> 
> 
> It sounds like you used the Bio.GenBank.FeatureParser to get a
> Bio.SeqRecord object.  In this case the record id usually comes from
> the VERSION line by default (and is normally the accession number with
> a dot and a version number appended).  If this is missing, then the
> first ACCESSION line is used.  As far as I can tell, any additional
> ACCESSION lines are lost.

Haven't realized there are "two" parsers. ;) The above was my case.

> 
> If you had used the Bio.GenBank.RecordParser to get a GenBank Record
> object then it might have been a little easier.  The ACCESSION line(s)
> should be in the list cur_record.accession

Usually I do dir(some_stuff) to inspect the object. There was nothing
like that. ;-)

> 
> In either case, I think the DEFINITION line in a GenBank file can be
> accessed as cur_record.description (but I haven't tried that as my
> dinner is getting cold).

Usually I do dir(some_stuff) to inspect the object. There was nothing
like that. ;-)

Actually, am in same TZ. ;)

Thanks for answers.
Martin



More information about the Biopython mailing list