[Bioperl-l] embl format

Tue Jul 22 15:46:51 EDT 2003

I am just learning about parsing EMBL format with BioPerl.  I have this
line:

ID   GNA1_DROM     STANDARD;      PRT;   219 AA;

Bio::SeqIO::embl tries to parse it using this pattern match:

$line =~ /^ID\s+(\S+)\s+\S+\;\s+([^;]+)\;\s+(\S+)\;/;

This fails, because the last chunk of the regular expression,
\s+(\S+)\; is space then not space, then a semicolon, but in the file it
is space, followed by 219, followed by a space, followed by AA followed
by a semicolon, and the match fails.  I've also found a website that
claimed an EMBL ID line looks like this:

ID   AA03518    standard; DNA; FUN; 237 BP.

Which should also fail in the same place.  Are these non-standard
formats that I should change to make work?  Otherwise, I wonder if the
regexp should be something more like:

$line =~ /^ID\s+(\S+)\s+\S+\;\s+([^;]+)\;\s+(\S+)\s*\S*[\.\;]/;

This way, (\S+) captures the number, \s* capture the space if it's
there, and \S* captures the BP or AA if it's there and the last bit
handles a semicolon or a period.

Also, in that first example, it uses the field that contains PRT to
guess the alphabet, and PRT doesn't appear to be handled when the parser
defines its alphabet.  Is this another non-standard thing?

Thanks,
Mike
-- 
---------------------------
Mike Olson
St. Olaf Mycobacterium Lab
St. Olaf College
Northfield, MN
507-646-3102
---------------------------