[BioPython] Cannot parse/convert embl formatted files
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Sat Aug 12 17:14:01 UTC 2006
Hi Peter,
Peter wrote:
> I'm not very familiar with the FormatIO system, so I'm not sure what
> to suggest there.
>
>>In principle, I do need to convert the file, what I really need is
---------------------^ not need ...
>
>> a parser from EMBL formatted data from
>> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
>> to parse out record with some feature. As I do not see an EMBL
>> parser in the Bio package I believe it is not available, right?
>
>
> You are right, there is currently no BioPython EMBL parser included in
> BioPython (other than whatever FormatIO can be persuaded to do on a
> good day). However, it is something that the developers would like to
> address (there has been some recent discussion on the mailing list
> about sequence input/output in general).
>
> Can you download the same data in GenBank format from another source
> like the NCBI instead?
No, it contains some extra annotation provided by that Italian site.
I managed to get it converted using bp_sreformat.pl to GenBank and
made biopython GenBank parser to parse it with some minor problems.
I do not know what is the general opinion but I observed errors with
file-input. I understand it is better to fix the input file format
but thought that maybe biopython could internally append the missing
`"' character at the end of the line when a new feature is met on the
next line:
5UTRef.Pln.dat
Unbalanced quote in:
/source="REFSEQ::XM_479174:1..213"
/gene="B1056G08.147"
/product="putative dihydropterin pyrophosphokinase
No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0> line 815235.
ID 5OSAR003520 standard; RNA; PLN; 213 BP.
XX
AC BR184455;
XX
DT 01-OCT-2004 (Rel. 4, Created)
DT 01-OCT-2004 (Rel. 4, Last updated, Version 1)
XX
DE 5'UTR in Oryza sativa (japonica cultivar-group), mRNA.
XX
DR REFSEQ; XM_479174;
DR UTRef; CR191654;
XX
OS Oryza sativa (japonica cultivar-group)
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP
OC clade; Ehrhartoideae; Oryzeae; Oryza.
XX
UT 5'UTR;
XX
FH Key Location/Qualifiers
FH
FT 5'UTR 1..213
FT /source="REFSEQ::XM_479174:1..213"
FT /gene="B1056G08.147"
FT /product="putative dihydropterin pyrophosphokinase
FT repeat_region 61..87
FT /source="REFSEQ::XM_479174:61..87"
FT /evidence="Pattern Similarity"
FT /repeat_type="GC_rich"
FT /repeat_family="Low_complexity"
XX
SQ Sequence 213 BP; 27 A; 85 C; 54 G; 47 T; 0 other;
ttcgcggatt accaaatcct atttcccgtc cactcggcgt cggctcctcg tgagttcttt 60
cgccggccgc cgccgccgcc cgcgccgatc cccatccatc ccgcaagcgc gcgcgcgagc 120
aggggccgca catcgcgttc gttccgctgc ttccgccgca tcctgggcgc tgcaatttcg 180
gttcagaatt ctccgcctca catatgcttg acg 213
//
I think the parser also problem with the continuation line ... but am not sure
now. Test yourself if you want. ;-)
ID 5OSA010809 standard; genomic DNA; PLN; 191 BP.
XX
AC BB302881;
XX
DT 03-JAN-2005 (Rel. 20, Created)
DT 03-JAN-2005 (Rel. 20, Last updated, Version 1)
XX
DE 5'UTR in Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 7,
DE PAC clone:P0552F09.
XX
DR EMBL; AP004308;
DR UTR; CC338570;
XX
OS Oryza sativa (japonica cultivar-group)
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP clade;
OC Ehrhartoideae; Oryzeae; Oryza.
XX
UT 5'UTR; Complete; 2 exon(s)
XX
FH Key Location/Qualifiers
FH
FT 5'UTR 1..191
FT /source="join(EMBL::AP004308:94626..94801,
FT EMBL::AP004308:95084..95098)"
FT /gene="P0552F09.130-2"
FT /product="putative
FT 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine
FT diphosphokinase"
FT repeat_region 72..98
FT /source="EMBL::AP004308:94697..94723"
FT /evidence="Pattern Similarity"
FT /repeat_type="GC_rich"
FT /repeat_family="Low_complexity"
XX
SQ Sequence 191 BP; 25 A; 78 C; 51 G; 37 T; 0 other;
gcagcttcgc cttcgcggat taccaaatcc tatttcccgt ccactcggcg tcggctcctc 60
gtgagttctt tcgccggccg ccgccgccgc ccgcgccgat ccccatccat cccgcaagcg 120
cgcgcgcgag caggggccgc acatcgcgtt cgttccgctg cttccgccgc atcctggaga 180
cattcaggaa g 191
//
Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
unassigned DNA, etc. I imagine those are some remnants from the EMBL data
and such value never exist in original GenBank ... you're the judge here.
Here is what I did:
for f in 5UTR*.dat.gz; do echo $f; n=`basename $f .dat.gz`; gzip -dc $f | \
sed -e 's/""$/"/' | sed -e "s/genomic DNA/DNA /" | \
sed -e 's/unassigned DNA/DNA /' | sed -e "s/genomic RNA/RNA /" | \
sed -e 's/unassigned RNA/RNA /' | sed -e "s/other RNA/RNA /" | \
sed -e "s/pre-RNA linear/RNA linear/" | \
sed -e "s/circularcircular/RNA circular/" | \
bp_sreformat.pl -if embl -of genbank -i - -o $n.gb; done
Last comment: it took me ages to figure with the sparse documentation that
cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
the LOCUS value. Still don't know how to get the DEFINITION value.
I probably desperate.
Martin
More information about the Biopython
mailing list