[BioPython] proposal 3

Andrew Dalke dalke@acm.org
Thu, 30 Mar 2000 14:23:59 -0700


This is a short one.

  From what I can tell, a sequence is characterized by a list
of residues, which I'm encoding via a list of letters and an
alphabet description.

  Is that all that's needed for a minimal data structure?

  I can think of one more - are the end points physical end 
points, or parts of a larger structure?  The characters in
a string are not exactly one-to-one equivalent to the residues.
Consider the carboxyl end of a protein.  Because it's a
terminal, it contains the extra "O-H", so it's mass and atom
count will be higher than any other residue in the middle of
the sequence with the same letter.

  So my proposal is:

Proposal 3 - The ends of a sequence may correspond to physical
ends of the real sequence.  This data is stored in the attribute
"endings", which has two elements, "left" and "right".  (Left
is position 0.)  The possible values for the elements are
UNKNOWN, TERMINAL, NONTERMINAL.

 The only results I can see it affecting are the atom count
and mass calculations, and only if the functions to calculate
the count and mass are accurate.

  Let me explain that.  An accurate mass calculation function
might look like:

def total_mass(seq, mass_table):
  mass = 0.0
  for c in seq:
    mass = mass + mass_table[c]
  return mass + 18.0  # the extra H and O-H for the terminals

In this case,
  mass(seq) != mass(seq[:len(seq)/2]) + mass(seq[len(seq)/2:])
because the 18.0 is added twice on the right hand side.

  Thus, slicing corresponds to a physical cut of the protein,
as compared to being a subsection of the string.  It doesn't
let you answer "what is the mass contribution of the first half
of the sequence?"

  This information will only be used rarely (as proof, biojava
and bioperl don't track this data).  Adding it means that every
constructor and factory function and generative function must
do the right thing for the ending.  There are three possible
values for each end, which makes things complex.  It will be
run often, but since the data is almost never used, this work
will be wasted.

  There would also need to be a function other than subslicing
used to modify the ending.  Eg,
   seq.cut(10, 40, chop = (TERMINAL, UNKNOWN))

  Since I don't like the complexity and performance hits,
I'm against the proposal.

                    Andrew
                    dalke@acm.org