[BioPython] Sequence numbering. Moving on...

Andrew Dalke dalke@bioreason.com
Mon, 04 Oct 1999 00:45:10 -0600


Iddo Friedberg <idoerg@cc.huji.ac.il> said:
> 1) A sequence will be a base class, with subclasses for nucleotide and
> peptide sequences. The subclasses will be used.

I've been playing around with a couple of ideas on how to deal with
different sequence types.  My problem was in dealing with Ewan's
comments on different sequence alphabets.   I identified the follow

  any (unconstrained alphabet, unspecified sequence type)

  generic dna (unconstrained alphabet, but somehow related to DNA)
  iupac ambiguous dna (GATCRYWSMKHBVDNX)
  iupac unambiguous dna (GATC)

  generic protein (unconstrained alphabet, some kind of protein)
  iupac ambiguous protein (ABCDEFGHIKLMNPQRSTVWXYZ)
  iupac unambiguous protein (ACDEFGHIKLMNPQRSTVWY)

As someone (here? bioperl?) pointed out, some of the sequencing
machines use case differences to indicate meaning, like certainty
of results.  I tried playing around with making a set of class
definitions to incorporate all of these differences, but it because
a complicated mess:

class SeqType:
    name = "unknown"
    alphabet = string.uppercase + string.lowercase  # or all ASCII? Unicode?
class ProteinSeqType(SeqType)
    name = "protein"
class DNASeqType(SeqType):
    name = "dna"
class RNASeqType(SeqType):
    name = "rna"

# These have restricted alphabets
class IUPAC_ProteinSeqType(ProteinSeqType):
    name = "iupac_protein"
    alphabet = "ABCDEFGHIKLMNPQRSTVWXYZ"
class Unambiguous_IUPAC_ProteinSeqType(IUPAC_ProteinSeqType):
    name = "unambiguous_iupac_protein"
    alphabet = "ACDEFGHIKLMNPQRSTVWY"  # The infamous 20

class IUPAC_DNASeqType(DNASeqType):
    name = "iupac_dna"
    alphabet = "GATCRYWSMKHBVDNX"
class Unambiguous_IUPAC_DNASeqType(IUPAC_DNASeqType):
    name = "unambiguous_iupac_dna"
    alphabet = "GATC"

class IUPAC_RNASeqType(RNASeqType):
    name = "iupac_rna"
    alphabet = "GATCRYWSMKHBVDNX"
class Unambiguous_IUPAC_RNASeqType(IUPAC_RNASeqType):
    name = "unambiguous_iupac_rna"
    alphabet = "GAUC"


If this was it, then I could live with it, but there are still problems.
For starters, I don't know all the different types of sequences people
deal with, like the UPPER/lowercase distinction.

I like the fact that if I have a sequence I can tell if it is a given
sequence type by doing something like:

  if isinstance(seq, IUPAC_RNASeqType):
     blah

However, down that path lies bad design, since checking for class
type is usually a very bad thing.  For one thing, it strongly suggests
implementations with multiple inheritance (one parent to get the right
data type and the other to inherit funcionality, like talking to
the database or via CORBA).

For another, it means that sequence cannot change their data type
(well, in 1.5.2 this is possible since you can dynamically change
your __bases__ but that's not something I think should be done in
well designed code).

This could be important if you want to have editable sequence data
structures.  Consider the following; you have a GUI where the user
is inputting sequences (already specified that it's DNA, RNA or
protein).  You have a toolbar listing the possible analysis that
can be done to the given sequence.  One of these can only be applied
to unambiguous DNA sequences, so must be greyed out when, say, "N"
is used, and ungreyed when the offending characters are removed.

It would be good to have the sequence stored in a single object
which didn't need conversion before being passed to the appropriate
routines.  Thus, if we were in a staticly typed language like C++ or
Java we would have problems because the mutable sequence type
doesn't cast directly to the "IUPAC unambiguous DNA" sequence type.
(It would be done with a wrapper of some sort.)

Since we are using Python, we can get away with mixing up types,
but at the peril of having run-time errors, and of making an
implementation which is hard to port to other languages.  Also, we
might not get some performance advantages.  For example, if there
are lower case letters and a given algorithm requires uppercase,
then I want to be able to translate things appropriately, but I
don't want to call string.upper if I don't need to do so.

I've been considering a couple of possibilities.  The first is based
on some discussion I recall from the last Python conference concerning
"protocols".  I've mentioned that in the bioperl list before.  In
Python, in most cases you don't need a given class, you need
something which implements the features of that class. For example,
there are many objects which implement the "input file protocol", that
is, methods like "readline()".  To make Python more statically type
safe, say, for a hyptothetical compiler or for design-by-contract
fans, you would like to say "this input parameter to this routine must
implement the input file protocol".

One idea there was to have "__protocols__" be a list of supported
protocols.  I'm thinking about extending Ewan's "moltype" to be
a list of strings, rather than a single string.  Currently it is
"protein", "dna", and "rna", but it could be

  ["dna", "iupac-dna", "unambiguous-iupac-dna"]

then the function could get type information, if it needs it, by
saying

   if "unambiguous-iupac-dna" in seq.moltypes:
      # do optimized work here since there are only 4 characters
   else:
      # calculate the answer, but also add the checks to make
      # sure we raise a TypeError if we see a character which isn't
      # one of the 4 allowed possibilities

This solution is available without multiple inheritance (basically
it emulates the __bases__ term of a class) and useable by other
languages, though perhaps in a slightly different form.

Also, it does not preclude having a type system.  For example, we
could still have:

| class Unambiguous_IUPAC_ProteinSeqType(IUPAC_ProteinSeqType):
|     seqtypes = IUPAC_ProteinSeqType.seqtypes + ["unambiguous-protein"]

(though I don't know if this is proper Python code!)


> 2) A method which returns a sequence, will return a sequence object,
> not a string.

I'm not sure I follow you here.  In your code you have:
>        def seq(self, min, max):
>                # with a base of 1 and including the end
>                # Negative slice notation not allowed
>                assert max >= min and min >0
>                return self[min-1:max]

This returns a sequence, but not a sequence object.  In fact, as 
your example shows, it returns a list of character strings.
> >>> seq(1,3)
> ['A', 'T', 'G']

I would rather deal with strings than arrays, for a few reasons.

1) Strings are immutable, and I prefer dealing with immutable data types
in general (you can make better assertions about them).  It is possible
to edit sequences, but I would rather not deal with that for now.

2) I would rather have a Numeric array of characters, which is much less
of a memory hog, than a list of strings, but then this requires my
hoped-for addition of the Numeric array types to standard Python in
the 1.6 release :)

3) Now you have three data types for sequences: a string (used in
the constructor), the Seq class, and the list of strings.  It is
better to reduce the number of fundamental data types.

I proposed having sequences act like strings so that there is,
in essense, only 1.5 data types-- most algorithms will work on strings
and the 0.5 is for those algorithms which might want to get, say,
information about the alphabet used in a given data type.

Besides, if you stay with strings you can always create a new
subseqence of the same data type with:

  sub_s = seq.__class__(seq[51:92])

If you take my example from Unambiguous_IUPAC_ProteinSeqType you'll
even preserve the correct seqtypes field.

> 4) I'm not very good at streamlining Python code. If anyone likes
> this and has a proposition on how to make this faster, I'd like to
> know.


> class Seq(UserList):
>         def __init__(self, inseq=''):
>                 self.data = []
>                 for i in inseq:
>                         self.append(i)

use
                  self.data = map(None, inseq)

>                 if min == None:

use

>                 if min is None:



>    def complement(self, min=None, max=None):
>            retSeq = DNASeq()
>            for i in self(min,max): # This calls the seq attribute in the
>                                    # base class
>                    retSeq.append(compTable[i])
>            return retSeq

Might want to try something like

> import string

>            terms = map(compTable.get, self(min, max))
>            try:
>               s = string.join(terms, "")
>            except TypeError:
>               raise TypeError, "sequence has untranslatable characters"
>            return DNASeq(s)

The TypeError occurs when .get fails and returns a None, which cannot
be turned into a string.  To be nice, you may want the exception to
check the string for the first character that fails, to return more
detailed information about the failure.

>                        try:
>                                c = c + chargeTable[i]
>                        except KeyError:
>                                pass

Use
>                        c = c + chargeTable.get(i, 0.0)


This, BTW, is in addition to Jeff's valid comments, which is that these
functions shouldn't be part of the class since they describe a more
generic algorithm, and putting them here requires much duplication
elsewhere in the future.

						Andrew