[BioPython] Cannot parse/convert embl formatted files

Sat Aug 12 17:14:01 UTC 2006

Hi Peter,

Peter wrote:
> I'm not very familiar with the FormatIO system, so I'm not sure what
> to suggest there.
> 
>>In principle, I do need to convert the file, what I really need is
---------------------^ not need ...

> 
>> a parser from EMBL formatted data from
>> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
>> to parse out record with some feature. As I do not see an EMBL
>> parser in the Bio package I believe it is not available, right?
> 
> 
> You are right, there is currently no BioPython EMBL parser included in
> BioPython (other than whatever FormatIO can be persuaded to do on a
> good day).  However, it is something that the developers would like to
> address (there has been some recent discussion on the mailing list
> about sequence input/output in general).
> 
> Can you download the same data in GenBank format from another source
> like the NCBI instead?

No, it contains some extra annotation provided by that Italian site.
I managed to get it converted using bp_sreformat.pl to GenBank and
made biopython GenBank parser to parse it with some minor problems.

I do not know what is the general opinion but I observed errors with
file-input. I understand it is better to fix the input file format
but thought that maybe biopython could internally append the missing
`"' character at the end of the line when a new feature is met on the
next line:

5UTRef.Pln.dat
Unbalanced quote in:
/source="REFSEQ::XM_479174:1..213"
/gene="B1056G08.147"
/product="putative dihydropterin pyrophosphokinase
No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0> line 815235.

ID   5OSAR003520 standard; RNA; PLN; 213 BP.
XX
AC   BR184455;
XX
DT   01-OCT-2004 (Rel. 4, Created)
DT   01-OCT-2004 (Rel. 4, Last updated, Version 1)
XX
DE   5'UTR in Oryza sativa (japonica cultivar-group), mRNA.
XX
DR   REFSEQ; XM_479174;
DR   UTRef; CR191654;
XX
OS   Oryza sativa (japonica cultivar-group)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP
OC   clade; Ehrhartoideae; Oryzeae; Oryza.
XX
UT   5'UTR;
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..213
FT                   /source="REFSEQ::XM_479174:1..213"
FT                   /gene="B1056G08.147"
FT                   /product="putative dihydropterin pyrophosphokinase
FT   repeat_region   61..87
FT                   /source="REFSEQ::XM_479174:61..87"
FT                   /evidence="Pattern Similarity"
FT                   /repeat_type="GC_rich"
FT                   /repeat_family="Low_complexity"
XX
SQ   Sequence 213 BP; 27 A; 85 C; 54 G; 47 T; 0 other;
     ttcgcggatt accaaatcct atttcccgtc cactcggcgt cggctcctcg tgagttcttt        60
     cgccggccgc cgccgccgcc cgcgccgatc cccatccatc ccgcaagcgc gcgcgcgagc       120
     aggggccgca catcgcgttc gttccgctgc ttccgccgca tcctgggcgc tgcaatttcg       180
     gttcagaatt ctccgcctca catatgcttg acg                                    213
//

I think the parser also problem with the continuation line ... but am not sure
now. Test yourself if you want. ;-)

ID   5OSA010809 standard; genomic DNA; PLN; 191 BP.
XX
AC   BB302881;
XX
DT   03-JAN-2005 (Rel. 20, Created)
DT   03-JAN-2005 (Rel. 20, Last updated, Version 1)
XX
DE   5'UTR in Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 7,
DE   PAC clone:P0552F09.
XX
DR   EMBL; AP004308;
DR   UTR; CC338570;
XX
OS   Oryza sativa (japonica cultivar-group)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP clade;
OC   Ehrhartoideae; Oryzeae; Oryza.
XX
UT   5'UTR; Complete; 2 exon(s)
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..191
FT                   /source="join(EMBL::AP004308:94626..94801,
FT                   EMBL::AP004308:95084..95098)"
FT                   /gene="P0552F09.130-2"
FT                   /product="putative
FT                   2-amino-4-hydroxy-6-hydroxymethyldihydropteridine
FT                   diphosphokinase"
FT   repeat_region   72..98
FT                   /source="EMBL::AP004308:94697..94723"
FT                   /evidence="Pattern Similarity"
FT                   /repeat_type="GC_rich"
FT                   /repeat_family="Low_complexity"
XX
SQ   Sequence 191 BP; 25 A; 78 C; 51 G; 37 T; 0 other;
     gcagcttcgc cttcgcggat taccaaatcc tatttcccgt ccactcggcg tcggctcctc        60
     gtgagttctt tcgccggccg ccgccgccgc ccgcgccgat ccccatccat cccgcaagcg       120
     cgcgcgcgag caggggccgc acatcgcgtt cgttccgctg cttccgccgc atcctggaga       180
     cattcaggaa g                                                            191
//

Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
unassigned DNA, etc. I imagine those are some remnants from the EMBL data
and such value never exist in original GenBank ... you're the judge here.
Here is what I did:

for f in 5UTR*.dat.gz; do echo $f; n=`basename $f .dat.gz`; gzip -dc $f | \
sed -e 's/""$/"/' | sed -e "s/genomic DNA/DNA /" | \
sed -e 's/unassigned DNA/DNA /' | sed -e "s/genomic RNA/RNA /" | \
sed -e 's/unassigned RNA/RNA /' | sed -e "s/other RNA/RNA /" | \
sed -e "s/pre-RNA linear/RNA linear/" | \
sed -e "s/circularcircular/RNA circular/" | \
bp_sreformat.pl -if embl -of genbank -i - -o $n.gb; done

Last comment: it took me ages to figure with the sparse documentation that
cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
the LOCUS value. Still don't know how to get the DEFINITION value.

I probably desperate.
Martin