[Bioperl-l] SeqIO embl parser bug?

Sam Griffiths-Jones sgj@sanger.ac.uk
Thu, 17 Oct 2002 17:23:43 +0100 (BST)


Eeek -- just been bitten badly by this one.

<confession> We in Team Pfam are stuck with an old version of bioperl
for legacy reasons (not sure why but this must be Ewan's fault :)
</confession>, but after a quick cvs update it seems that bioperl-live
still has the same behaviour. Apologies if I'm wrong and this has been
fixed.

Anyway -- embl parser does:

       #accession number
       if( /^AC\s+(.*)?/ ) {
           my @accs = split(/[; ]+/, $1); # allow space in addition
           $params{'-accession_number'} = shift @accs;
           $params{'-secondary_accessions'} = \@accs;
       }

This gets it wrong when there's more than one AC line - eg:

ID   ECAPAH02   standard; DNA; PRO; 111408 BP.
XX
AC   D10483; J01597; J01683; J01706; K01298; K01990; M10420; M10611; M12544;
AC   V00259; X04711; X54847; X54945; X55034; X56742;
XX
SV   D10483.2
..

The primary accession gets called as V00259, with 5 secondary
accessions.  This is particularly nasty in this case as there's
another EMBL entry with primary id V00259 and different sequence .....
:(

Sam

--------------------------------------------------------------------
Sam Griffiths-Jones                              sgj@sanger.ac.uk
http://www.sanger.ac.uk/Users/sgj                +44 (0)1223 834244

Wisdom #4885:  It's always darkest before dawn, so if you're going
to steal your neighbour's newspaper, that's the time to do it.
--------------------------------------------------------------------