[Biopython] how to validate fasta format

Peter biopython at maubp.freeserve.co.uk
Tue Oct 27 13:20:58 UTC 2009


On Tue, Oct 27, 2009 at 1:14 PM, Steve Darnell <darnells at dnastar.com> wrote:
>
> Greetings,
>
> This particular thread addresses a topic we've revisited lately,
> ambiguity codes (particularly in the amino acid alphabet).  I would like
> to query the group for their opinion of the remaining 6 characters after
> you remove the 20 standard amino acids.  Here's our list:
>
> B - Asn or Asp
> J - Ile or Leu
> O - ???
> U - seleno-Cys
> X - Any
> Z - Gln or Glu

Your list is incomplete. According to the Biopython
ExtendedIUPACProtein alphabet docstring, which is based on the IUPAC
standards or recommendations:

    B = "Asx";  Aspartic acid (R) or Asparagine (N)
    X = "Xxx";  Unknown or 'other' amino acid
    Z = "Glx";  Glutamic acid (E) or Glutamine (Q)
    J = "Xle";  Leucine (L) or Isoleucine (I), used in mass-spec (NMR)
    U = "Sec";  Selenocysteine
    O = "Pyl";  Pyrrolysine

In practice, X is also often used to mean any amino acid or a stop
codon too (although this really would benefit from a more explicit
character in my personal opinion).

Peter




More information about the Biopython mailing list