[BioPython] modified nucleotides

Mon, 17 Apr 2000 04:57:26 -0600

 Christoph Wierling:
>   I think, for some reasons it may be necessary to know if a DNA- or
>RNA-nucleotide of a sequence is modified e.g. by methylation.
>Obviously the IUPAC does not give any suggestions, but I think it should
>be take into consideration how to describe modifications like
>N6-methyladenosine, 5-methylcytosine, or N4-methylcytosine in
>DNA-sequences (e.g. in any derivation of the Alphabet class).

I recall a bit in the IUPAC documentation, but I'm about to go to bed
and don't want to look it up now.  I believe it says that some
modifications are position dependent (eg, the 5th base from the
3' end) and some are statistical.

I did design the "generic" functions with these in mind.

The major support, as you mention, is the Alphabet class.  If a given
position is always methylated, create your own alphabet, derived
from Bio.Alphabet.RNAAlphabet, using "AUCG" plus a special character
of your own, and translate your input sequence into this encoding.
(In general, that character could be used anywhere, but since you
control the I/O, you know that it only occurs in one place.)

This won't work as well for statistical properties.  Probably the
easiest is to generate an instance of the allowed distribution
(eg, randomly (but biologically appropriatly) methylate bases).
This won't given the correct results for some cases.

Once you have the character sequence and alphabet, you will need
to have data tables for the new Alphabet, and likely things like a
translation object.

Most of the generic functions will work so long as the new alphabet
is registered with the property manager along with the relevant
data tables and conversion objects.

But I have never used anything with modified DNA/RNA -- that is,
I've not done sequence analysis with them, since I've done some
structure visualization work of some.  So I don't know the pitfalls.

What are some use cases that give me a better feel?  Can I
use "a sequence of a finite set of characters plus an alphabet
encoding"?

                    Andrew
                    dalke@acm.org