[BioPython] PropertyManager

Sun, 9 Apr 2000 04:42:25 -0600

Here's the "PropertyManger" email I mentioned a couple of days
ago that I would write.

This is "Proposal 4", but I can't specify it well enough as a simple
proposal, so I'll just do the commentary part.

There were two problems I was struggling with: how can I write
`generic' code, and how can I simplify accessing the huge amount of
data associated with each sequence type.

The meaning of `generic' goes back to my first proposal.  There are
many different alphabet encodings.  I want to support the non-IUPAC
and non-character schemes as much as is reasonably possible.

(As an aside: I learned of another non-character encoding last week.
Use integers to represent proteins, so bit 1 is Gly, 2 is Cys, etc.
20 amino acids < 32 bits.  This is used to track ambiguous amino acid
assignments.)

The huge amounts of data includes names, molecular weight, volume,
surface area, hydrophobicity, codon tables, alignment matricies.
AAIndex <http://kegg.genome.ad.jp/dbget/aaindex.html> lists 434
per-amino acid tables.

******

It's best to explain things with an example, so here's a simple
routine to find the molecular weight of a sequence.  The residue
weights are passed in "weight_table."

  def compute_mw(seq, weight_table):
      weight = 0.0
      for c in seq.data:
          weight + weight + weight_table[c]
      return weight

(Or the one liner:
def compute_mw(seq, w): return reduce(lambda x,y,w=w: x+w[y], seq.data, 0.0)
 :)

The weight table is tied to the alphabet encoding of the sequence, so
the caller needs to track which alphabet was used to get the correct
set of weights table.  The "thing which made the sequence" has to tell
the "thing which needs the weight" how to get the right weight table.

There's no way to specify the default table for a given sequence type.
This makes using 'compute_mw' a bit more complicated.  Eg, in
interactive mode you always have to do something like:

   weight = compute_mw(seq, Bio.Data.IUPACData.protein_weight_table)

******

I want to simplify the interface.

Having explicit if/then type() checks in every function for the
sequence is right out.  (It could work in C++ because the appropriate
function is chosen from its type signature.)

One way to solve this is to have these attributes be part of either
the sequence or, more specifically, the alphabet.  Then the caller
could do:

  compute_mw(seq, seq.alphabet.weight_table)

or the callee could do:

  def compute_mw(seq, weight_table = None):
      if weight_table is None:
          weight_table = seq.alphabet.weight_table
      ...

I like the latter form since there is an increase in usability for
nearly everyone (only rarely do you need a different weight table)
while only a slight performance hit (checking if the tabls is None).

BTW, 'alphabet.weight_table' might be written
'alphabet.property["weight_table"]'.  Otherwise, new data gets
assigned to new attributes, and I don't like adding attributes
dynamically.  It's legal in Python, but I've a strong feeling against
it from my old C++ days.

******

Another way is to store the properties in an external table, and keyed
off of the alphabet's class.  This corresponds to having only class
data for an alphabet, which I believe is acceptable.

A simple property manager would be a dictionary, of dictionaries,
as in:

default_manager = {
    Alphabet.IUPAC.IUPACProtein: {
        "weight_table": Data.IUPACData.protein_weight_table,
        "atom_counts": Data. ...
    },
    Alphabet.IUPAC.IUPACAmbiguousDNA: {
        "weight_table": Data.IUPACData.ambiguous_dna_weight_table,
        "atom_counts": Data. ...
    }
}

All this is doing is mapping from a class+property name to an existing
object, so the caller can use the proper weight table, if known:

   weight = compute_mw(seq, Bio.Data.IUPACData.protein_weight_table)

and the callee level is changed to:

  def compute_mw(seq, weight_table = None):
      if weight_table is None:
          weight_table = default_manager.resolve(seq.alphabet.__class,
                                                 "weight_table")
      ...

******

I had originally thought that putting the data as some sort of
property lookups of the alphabet type (the first solution) would be
too complicated.  While writing up this email, I see the first way
actually is pretty easy to implement using the property manager
(second solution).  It's:

   class Alphabet:
     ... standard definition here ...
     def resolve(self, property):
         return default_manager.resolve(self.__class__, property)
     def add_property(self, property, obj):
         default_manager.class_property[self.__class__][property] = obj

I need to consider this this more fully.  In the meanwhile, I'll
describe using the property table.

******

Then there are the hundreds (thousands?) of items which can be made
attributes of the alphabet.  There will have to be some way to
associate new properties to existing alphabets.  I mentioned a couple
in passing, above.  My strong belief is that the origin of the data
(from biopython or not) should be accessed in similar fashions.

For example, assuming the alphabet was not mutable in any fashion,
then there would be no way to add new properties to it.  Then code
would use the alphabet for some data, and some other means for other
data.  Ugly.

This was the reason I didn't want to have properties stored directly
via alphabets.  It still is the reason why I think alphabets shouldn't
have per-instance data.

******

When should the association be made?

Originally I wanted to have the tables directly tied to the alphabets,
but ran into problems with the codon table.  I've been using
translation as my test-bed for experiments with alphabet type-safety.
A CodonTable, in addition to storing the mapping from nucleotides to
proteins, also stores the 'nucleotide_alphabet' and the
'protein_alphabet.'  In that way, input parameters can be checked with

   assert isinstance(dna.alphabet, self.nucleotide_alphabet.__class__)

and the proper output alphabet assigned with

    return Seq.Seq(string.join(letters, ''), self.protein_alphabet)

To preserve the simple interface, there are a set of codon table
properties for the sequence.  It's a bit more complicated than that,
since translation objects are really stored, but that's another email.
The end result looks like:

  def translate(seq, id = None):
      if id is None:
          prop_name = "translator"
      else:
          prop_name = "translator.id.%d" % id
      translator = default_manager.resolve(seq.alphabet, prop_name)
      return translator.translate(seq)

Because of the import loop, the sequence alphabet must be very careful
about initialization.  The alphabets must be fully loaded before
importing the codon tables, or any of the other alphabet-safe objects.

The only thing which is guaranteed to be imported is the alphabet,
because it's used to create the sequence object in the first place.
Thus, the import of the codon table could be done where the alphabet
is defined.

But then *all* of the data will be loaded, which impacts the startup
time.

I could say "before you use 'translate' using default values you must
first import the appropriate translation module" and put the property
table assignment code in that module. But the point of having an
association was to simplify things like

   weight = compute_mw(seq, Bio.Data.IUPACData.protein_weight_table)

into

   weight = compute_mw(seq)

not
   import Bio.Data.IUPACData
   weight = compute_mw(seq)

(especially since pylint will complain about Bio.Data.IUPACData not
being used for anything.)

******

So I chose to have an on-demand property resolver.  When resolving a
class property, a dictionary is checked to see if an object had been
assigned.  If not, another dictionary is checked to see if a resolver
object is assigned for that property.

If so, the resolver is called to get the property object.  The IUPAC
alphabet definition can end with something like:

  def _bootstrap(resolver, klass, property):
      del resolver.class_property_resolver[klass][property]
      from Bio.Encodings import IUPACEncoding

  default_resolver.class_property_resolver[IUPACProtein]["translate"] = \
     _bootstrap
  default_resolver.class_property_resolver[IUPACUnambiguousDNA]["translate"]
     _bootstrap

and all of the associations are made in IUPACEncoding.

To simplify things further, if the object resolver fails, then a
generic class resolver is checked.  Thus, the bootstrap code is really
more like:

  def _bootstrap(resolver, klass, property):
      del resolver.class_resolver[klass]
      from Bio.Encodings import IUPACEncoding

  default_resolver.class_resolver[IUPACProtein] = _bootstrap
  default_resolver.class_resolver[IUPACUnambiguousDNA] = _bootstrap

If that fails, the alphabet's class heirarchy is traversed, and the
three tables are checked for each parent.

This design means that *all* of the data might still be loaded, but
only if a function is called which needs the alphabet data.  I prefer
the

   weight = compute_mw(seq, Bio.Data.IUPACData.protein_weight_table)

construct, so all of the data will be accessible via means other than
the property manager.  If the performance is that important, keep
better track of alphabet-specific data and only load the needed modules.

This design, BTW, is a hybrid between C++'s and Python's attribute
lookup scheme.  It replaces C++'s type system with a PropertyManager,
the hierarchy traversal order is the same as Python's, and the
resolvers act like __of__ (from ExtensionClass) and __getattr__ hooks.

******

I really mean this only as a way to simplify access to data present
via normal means.  More precisely, to allow for default values for
certain parameters.

There are problems with the scheme.  It uses a global table to pass
information around, and I don't like global variables very much.  (I
rather prefer have everything be specifiable as parameters.)

It's easy to abuse.

It doesn't really allow plugins.  There are hooks for using new
alphabets, but not for automatically adding new data to existing
alphabets.  The default action top-level class resolver (eg, for
`Alphabet') could search some os.environ defined path to find the
class+property name loader, but I've not put this to practice.

A proper naming scheme has to be introduced.  I'm thinking of "."
seperated fields in a hierarchy.  Fields have the same syntax as
variables ([A-Za-z_][[A-Za-z_0-9], with an emphasis on lower case.

Where does data get placed in the hierarchy?

Should the be a way to get all objects for a certain pattern, eg,
to find all subtrees of "translater.id" ?

There's no thought given to lookup up data with multiple alphabets.
Suppose you needed a nucleotideXprotein data object?  (First thought,
this is a double dispatch problem, so use chained resolvers, where the
nucleotide lookup returns a resolver to use for the protein lookup,
which returns the object.)

******

About two years ago, Jeff Chang and I were talking about automating
information resolution.  The idea is to allow a programmer to say

  "Give me the total weight of <this sequence>"

It might be that a "total wieght" algorithm isn't known, but the
requst looks like a function call, so a function call resolver and be
queried for

  "Do you know how to find the 'total weight' of <this sequence>?"

This could come back with a handle which implements the function call
semantics, that is, it returns `f', where `f(sequence)' is the total
weight.

It could even be that the function call resolver itself queries other
sites to get the function call (very much like DNS), and eventually
return an 'f' which actually goes through CORBA to get the result.

In this viewpoint, the code in Bio.utils which actually uses the
default_manager can be considered as attributes of the "utils"
resolver.

This property manager code is my first attempt at experimenting with
using resolvers.  As such, I am deliberately trying to keep it use
simple and not required for biopython development.  Hence the emphasis
on keeping its use minimal.

A full-blown solution would also have things like cost, quality and
time of service, and optimizers for dealing with multiple possible
solvers.  That's all *ahem* scheduled for version 8.  :)

******

You know, I do believe this text is longer than the actual code by a
factor of 4 (12K vs. 3K).  I have a tendency to write a lot and fill
in the details.  Is this amount of email/detail too much?

                    Andrew Dalke
                    dalke@acm.org