[Bioperl-l] SeqIO problems when reading Ensembl files

simon andrews (BI) simon.andrews@bbsrc.ac.uk
Thu, 31 Jan 2002 10:21:59 -0000


SeqIO/embl is having problems reading the EMBL files generated by Ensembl.
The moltype and section aren't being parsed out correctly.  Running an
Ensembl sequence in and out of SeqIO changes the headers from this:

ID   Chromosome 11 -74161 to 925839  ENSEMBL; DNA; HTG; 1000001 BP.
XX
AC   Chromosome 11 -74161 to 925839;
XX
SV   NO_SV_NUMBER
XX
DE   Reannotated sequence via Ensembl
XX
KW   HTG; HTGS_PHASE.

To this:

ID   unknown id standard; XXX; UNK; 1000001 BP.
XX
AC   Chromosome;11;-74161;to;925839;
XX
DE   Reannotated sequence via Ensembl
XX
KW   HTG; HTGS_PHASE.
XX

When running under warnings I get the following warning when trying to write
out this sequence:

Use of uninitialized value in pattern match (m//) at
/usr/lib/perl5/site_perl/5.6.0/Bio/SeqIO/embl.pm line 352, <GEN0> line
20145.

This comes from the moltype being undefined.

I can fix the parser by changing line 153 in embl.pm from 

	$line =~ /^ID\s+(\S+)\s+\S+\;\s+([^;]+)\;\s+(\S+)\;/;

to 

	$line =~ /^ID\s+(\S+).+\;\s+([^;]+)\;\s+(\S+)\;/;

...I've looked through a few non-Ensembl EMBL files, and I don't think this
breaks anything else - but people may have come across other unusual
examples.

This is using BioPerl 0.9.3.


Simon.

----
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute

simon.andrews@bbsrc.ac.uk
+44 (0)1223 496463