[Biopython-dev] RNA alphabets; was Bio.PDB enhancements

Wed Jun 2 13:22:36 UTC 2010

On Wed, Jun 2, 2010 at 1:21 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
> I'm afraid the matter is more complicated. To date, we have 115 modified
> RNA bases, which means in practice that you run out of nice ASCII
> characters. Moreover, some people use one-letter symbols in RNA as
> wildcards (R for purine, Y for pyrimidine). As a consequence, several sets
> of abbreviations have been developed - see
> http://modomics.genesilico.pl/modification_list to get an impression.
>
> We've written for our own purposes a class containing different ways of
> nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like
> to change that.
>
> Best Regards,
>   Kristian

Hmm. I wonder if the HTML entities would work nicely in Python
(as unicode)? That way you could have an unambiguous string
representation where each letter is one character long.
I'm thinking a Seq subclass (with a special alphabet) might be
the way to go here, allowing access to the single character
entities by default but also the longer codes as well.

There are similarities with modified peptide sequences where
there are clear three letter codes, but not one letter codes.

Tricky.

Peter