[Biopython-dev] [Bug 2548] Updating IUPACData and ExtendedIUPACProtein for U and O

Mon Jul 21 11:10:02 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2548

------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2008-07-21 07:10 EST -------
I've gone over the GenBank release notes on this issue...

Quoting ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb131.release.notes (Dated
August 15 2002, similar text appears in earlier files too as a warning of
intended changes)
==============================================================
1.3.3 Selenocysteine representation

  At the May 1999 DDBJ/EMBL/GenBank collaborative meeting, it was learned
that IUPAC plans to adopt the letter 'U' for selenocysteine.

  With this August 2002 release, selenocysteine residues are now presented
via residue abbreviation 'U', in both /translation and /transl_except
qualifiers.
==============================================================

By now they SHOULD have fixed any sequences which were using X for
selenocysteine to use U instead.

Quoting ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb156.release.notes (Dated
October 15 2006, similar text appears in earlier files too as a warning of
intended changes)
==============================================================
1.3.4 New protein residue abbreviation for Pyrrolysine

  Sequence databases use single-letter amino acid abbreviations to
record the primary structure (sequence) of amino acids in a polypeptide.
The table of abbreviations includes only those amino acids that are
encoded in the genetic code and directly inserted by a tRNA during the
process of protein translation.  Post-translational modifications are
not represented in the sequence data itself, but may be described by
features annotated on the sequence.

  The discovery of the 22nd naturally encoded amino acid, pyrrolysine,
and the recent submission of sequence records that should contain
this residue, require the adoption of a new amino acid abbreviation.
Because several letters are assigned to represent different experimental
ambiguities, the only letter still available for use is O (uppercase
letter o).  Scientists working in the field have independently suggested
use of this letter, and it has a reasonable mnemonic, pyrrOlysine.

  The IUPAC-IUBMB Joint Commission on Biochemical Nomenclature has agreed
that Pyl/O will be recommended for this amino acid.

  The consequences for flatfile users are that O can now appear in CDS
/translation qualifiers, and that Pyl (the three-letter abbreviation)
can appear in CDS /transl_except qualifiers and in the /product and
/anticodon qualifiers of tRNA features. These changes are legal as of this
October 2006 GenBank Release.

  Sample ASN.1, FASTA, GenBank flatfile, and INSDSeq XML files for CP000099,
which has a protein with a pyrrolysine residue, are available for testing
purposes at the NCBI FTP site:

        ftp://ftp.ncbi.nih.gov/genbank/Pyrrolysine_Samples

        Files:

        CP000099.pse    (print-form ASN.1 Seq-entry)
        CP000099.gbff   (GenBank flatfile)
        CP000099.aa_fsa (protein FASTA)
        CP000099.isx    (INSDSeq XML)

==============================================================

And later on in the same file,
==============================================================
1.3.5 Protein residue J for leucine/isoleucine ambiguities

  The residue abbreviation J is reserved for mass spectrometry experiments that
cannot distinguish leucine from isoleucine. Although this abbreviation has
been part of the IUPAC recommendations for some time, it has not previously
appeared in protein sequences in the GenBank database.

  As of October 2006, abbreviation J is legal in CDS /translation qualifiers,
and Xle (the three-letter abbreviation) will be allowed in CDS /transl_except
qualifiers and in the /product and /anticodon qualifiers of tRNA features.

  J will also be mapped to unknown (X) for the purpose of BLAST and other
sequence similarity search tools.
==============================================================

So, according to GenBank, "The residue abbreviation J is reserved for mass
spectrometry experiments that cannot distinguish leucine from isoleucine ...
this abbreviation has been part of the IUPAC recommendations for some time".

I would prefer a direct citation, but that seems good enough evidence to me to
include J in the Biopython IUPAC extended protein alphabet.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.