[Bioperl-l] Parsing EMBL DR lines with 1 accession

Jason Stajich jason at cgt.duhs.duke.edu
Tue Mar 16 11:41:22 EST 2004


james - please submit this as bug to http://bugzilla.open-bio.org so we
can keep track of when it gets fixed.

-jason
On Tue, 16 Mar 2004, James Abbott wrote:

> Greetings bioperlers...
>
> I have been using Bio::SeqIO to parse EMBL files, and noticed that some
> of the database cross-references (DR lines) were missing from the
> returned RichSeq object. The missing references were to the GOA
> database, which only have a primary id - the secondary id/accession
> usually found in (for example) swissprot/trembl references is missing
> i.e. (from EMBL:AE000562)
>
> DR   GOA; O25226.
> DR   GOA; P96551.
> DR   SPTREMBL; O25217; O25217.
> DR   SPTREMBL; O25218; O25218.
>
> These SPTREMBL cross references are parsed fine, however the GOA
> references are skipped. Looking at the code in question in
> Bio::SeqIO::embl, although there is provision for dbxrefs with a single
> id, the regex requires a trailing ';' after the primary accession. I
> have included a diff against embl.pm v 1.72 (see below...) which alters
> this regex to optionally match the second ';', allowing the id in DR
> lines with a single accesion to be parsed as the primary accession.
>
> Writing these entries back out again using Bio::SeqIO::embl results in
> these DR lines appearing as:
>
> DR   GOA; 025226; .
>
> however the examples given in the EMBL user manual, and all those I've
> found in EMBL, lack this second ';' and following whitespace present.
> The second modification in the diff modifies the behaviour of
> Bio::SeqIO::embl when there is  secondary accession present, ensuring
> that the DR line is written out as
>
> DR   GOA; 025226.
>
> Cheers,
> James
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list