[BioPython] some code

Andrew Dalke dalke@acm.org
Fri, 7 Apr 2000 06:46:50 -0600


So I've put the code I was experimenting around with at

ftp://starship.python.net/pub/crew/dalke/biopython/ 

There are two packages there.  The first is Bio-0.1.tar.gz,
which is the Sequence code.  To get things going (on a unix-like
system), get Bio-0.1 and
  ./configure ; make ; make check

There are actually three small regression tests under test/,
so you can see what working code is supposed to look like.
No guarantees that anything works, though "make check" works
on the starship.

The other package is a modified version of automake with Python
support.  It's been over a year and even though I sent in the
copyright disclaimer to the FSF, it still hasn't been added to
the main tree.  What is does is automate making of Python
distributions like this (configuration testing, the byte-code
compilation, and creating the .tar.gz files).  Plus, I've got
some extra code running regression tests, code coverage, and
build tagging for CVS.

The autoconf stuff won't work for installation under MS Windows,
but everything is runnable in-place.  BTW, I would like to get
it working with distutils as well.

Probably the most unusual code is the PropertyManager.  I was
starting to do a full writeup of it, but it got to be quite
late.  The general idea is I wanted a way to have generic
functions, like "translate", "revcomp", which were still alphabet
specific.

Eg, I wanted
  translate(dna)
and
  translate(rna)

to both give me a protein, rather than something like

  Bio.Tools.Translate.unambiguous_dna_by_id[1].translate(dna)
and
  Bio.Tools.Translate.unambiguous_dna_by_id[1].translate(dna)

In C++ this is easy (I think) since translate would be a
template parameterized on the alphabet type, which would be
matched to some default translation table for that type.

In Python, the translate function would usually need to do:
  if type(seq.alphabet) == IUPAC.DNA:
     return Bio.Tools...translate(dna)
  elif type(seq.alphabet) == IUPAC.RNA:
     retun ...

but adding more sequence types, like one which tries to encode
data with upper/lowercase values, means having to change every
function to add a new type() check.

What a PropertyManager does is automate the typed-property
lookup.  I can associate a property called "translator" to
the IUPAC.DNA alphabet, and point to the right default translator
for that class, and have the "translator" of RNA be appropriate
for RNA.  If the property isn't available for a given class,
the property resolver will walk the __bases__ of the class to
see if it's present in a parent.

Additionally, if I'm being alphabet-type-safe, then the translation
routines need to ensure the input is the right nucleotide alphabet,
but the alphabets need to tell the property manager which are the
default routines for that type.

This causes a circular import loop.  C++ solves it by having
both declarations and definitions.  I solved it by adding resolver
hooks into the property table.  If the property doesn't exist for
a class, then a class property resolver is used, if available.
If that fails, a generic class resolver (which handles all properties
for the class) is used, if available.  Only then does it traverse
through the parents.

This can reduce load time, since the mappings aren't fully made
until they are needed.

I also wrote it this way so some alphabets can completely leverage
off others.  For example, a three letter sequence encoding (which
would have a ThreeLetterProteinAlphabet) could have a class resolver
that then looks into the IUPACProteinAlphabet for the right values.

                    Andrew