[Bioperl-l] EMBL release 87 format changes.

Chris Fields cjfields at uiuc.edu
Wed Jul 19 17:46:43 EDT 2006


You can go ahead and submit the patch to Bugzilla anyway.  Comments about
the proposed changes from the developers can be added there.

I think there's some confusion here, though: the EMBL SeqIO change you
mentioned I committed is actually for Bio::SeqIO::swiss (SwissProt).  I
haven't touched Bio::SeqIO::embl (yet).  'swiss' format now reads old and
new swiss data files and writes only new format; no major changes have been
made to SeqIO::embl in about a year (and even that was a small one).

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com
> Sent: Wednesday, July 19, 2006 2:48 PM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] EMBL release 87 format changes.
> 
> BioPerl Users and Developers,
> 
> I have updated the EMBL SeqIO parser to work correctly with Release 87 of
> EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier
> message, the EMBL parser now reads both new and old formats, but only
> writes the new format.
> 
> I don't think that my changes will affect most users, but if you are using
> the EMBL format can you review the changes described below and speak up if
> anything looks like it could create a problem for you?
> 
> If I don't hear any objections soon, I will submit a patch to bugzilla.
> 
> Thanks,
> 
> - David
> 
> Parser changes:
> 
> - EMBL files no longer contain the "entry name".  When reading old format
> files,
>   the EMBL "entry name" from the ID line is used as the Bio::Seq::id and
>   Bio::Seq::display_id, but when reading new format files, the accession
> number
>   is used for these fields.
> 
> Changes to output:
> 
> - The ID line was changed to the new format.
> 
> - The SV line is never written; SV is now part of the ID line.
> 
> - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now
> written
>   as "unassigned DNA" and "unassigned RNA"
> 
> - Strictly speaking, EMBL format should only be used for nucleotide
> sequences.
>   If the alphabet is 'protein', write_seq() emits a warning and writes the
> 
>   non-standard molecule type "AA" in the ID line.
> 
> - Because BioPerl sequences do not have a "data class" attribute, all
> sequences
>   are written with a data class of "STD" in the ID line.
> 
> - The ID line contains the Bio::Seq::accession, unless it is missing, in
> which
>   case the Bio::Seq::id is used.
> 
> - molecule type is strictly validated.  Non-EMBL values are output as
>   "unassigned DNA" or "unassigned RNA", depending on the sequence
> alphabet.
> 
> - "taxonomic division" is strictly validated.  Non-EMBL values are output
> as "UNC".
> 
> - The taxonomic division code "UNK" is now written as "UNC"
> (unclassified).
> 
> Possible Gotchas for some users:
> 
> - Because the EMBL entry name is no longer included anywhere in the file,
>   when round-tripping from old format to new format the entry name will be
> lost.
> 
> - In order to ensure that BioPerl writes valid EMBL files, I have added
> strict
>   validation to the writer for "molecule type" and "taxonomic division".
> This
>   could present a problem for users who are using non-standard values for
> these
>   fields, but I felt it was important to write files that adhere to the
> EMBL spec.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list