[Bioperl-l] EMBL release 87 format changes.
dwaner at scitegic.com
dwaner at scitegic.com
Wed Jul 19 15:47:58 EDT 2006
BioPerl Users and Developers,
I have updated the EMBL SeqIO parser to work correctly with Release 87 of
EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier
message, the EMBL parser now reads both new and old formats, but only
writes the new format.
I don't think that my changes will affect most users, but if you are using
the EMBL format can you review the changes described below and speak up if
anything looks like it could create a problem for you?
If I don't hear any objections soon, I will submit a patch to bugzilla.
Thanks,
- David
Parser changes:
- EMBL files no longer contain the "entry name". When reading old format
files,
the EMBL "entry name" from the ID line is used as the Bio::Seq::id and
Bio::Seq::display_id, but when reading new format files, the accession
number
is used for these fields.
Changes to output:
- The ID line was changed to the new format.
- The SV line is never written; SV is now part of the ID line.
- "DNA" and "RNA" are no longer valid EMBL molecule types. They are now
written
as "unassigned DNA" and "unassigned RNA"
- Strictly speaking, EMBL format should only be used for nucleotide
sequences.
If the alphabet is 'protein', write_seq() emits a warning and writes the
non-standard molecule type "AA" in the ID line.
- Because BioPerl sequences do not have a "data class" attribute, all
sequences
are written with a data class of "STD" in the ID line.
- The ID line contains the Bio::Seq::accession, unless it is missing, in
which
case the Bio::Seq::id is used.
- molecule type is strictly validated. Non-EMBL values are output as
"unassigned DNA" or "unassigned RNA", depending on the sequence
alphabet.
- "taxonomic division" is strictly validated. Non-EMBL values are output
as "UNC".
- The taxonomic division code "UNK" is now written as "UNC"
(unclassified).
Possible Gotchas for some users:
- Because the EMBL entry name is no longer included anywhere in the file,
when round-tripping from old format to new format the entry name will be
lost.
- In order to ensure that BioPerl writes valid EMBL files, I have added
strict
validation to the writer for "molecule type" and "taxonomic division".
This
could present a problem for users who are using non-standard values for
these
fields, but I felt it was important to write files that adhere to the
EMBL spec.
More information about the Bioperl-l
mailing list