Bioperl: Bioperl SEQ functions

Andrew Dalke dalke@bioreason.com
Wed, 30 Jun 1999 22:05:06 -0600


Aaron J Mackey <ajm6q@virginia.edu> suggested sample code
of the form:
> print Bio::Tools::Hydrophobicity->kyte_doolittle($seq); # yee haw!

Following is some software philosophy.  I am against this
style of interface because it requires the underlying code
to have too much duplication.  Here's why, and a different
(IMO better :) solution.

Suppose you have two scales (the post I made last March also
give the Engelman, Steitz and Goldman values).  By extension you
would compute it with something like:
> print Bio::Tools::Hydrophobicity->esg($seq);

The direct implementation of this code is (in pseudocode):

%kyte_doolittle_scale = { 'A' => .... };
%esg_scale = { 'A' => .... };

sub kyte_doolittle (takes a $seq) {
  total = 0
  foreach residue in seq:
    total += kyte_doolittle_scale[residue]
  return total / length(seq)
}

sub esg (takes a $seq) {
  total = 0
  foreach residue in seq:
    total += esr_scale[residue]
  return total / length(seq)
}

In other words, the two functions are almost duplicates except for
the name of the hash to use.

An implementation with less duplication would be:

sub compute (takes $seq and %values) {
  total = 0
  foreach residue in seq:
    total += values[residue]
  return total / length(seq)
}
sub kyte_doolittle (takes a $seq) {
  return compute($seq, \%kyte_doolittle_scale)
}
sub esg (takes a $seq) {
  return compute($seq, \%esr_scale)
}



With this implementation, a problem will arise if you want to
test different hydrophobicity scales.  To preserve consistency,
each new scale requires a new function to be written.  Otherwise,
the code will have some functions calling kyte_doolittle directly
while others reference a table hash and the "compute" function.

There is another problem if you consider the broader range of where
you might compute the hydrophobicity; eg, suppose you also want
to see it windowed over the whole sequence.  For consistency in
this case you'll also have to write "windowed_kyte_doolittle"
and "windowed_esg" and ....  To top it off, you'll need something
to convert a selection string (say, a pull-down menu from the
viewer) to the different functions.  They'ld look like:

  if ($selection eq "kyte_doolittle") {
    $hydrophobicity_func = \&kyte_doolittle;
    $windowed_func = \&windowed_kyte_doolittle;
  } elseif ($selection eq "esg") {
    $hydrophobicity_func = \&esg;
    $windowed_func = \&windowed_esg;
  } ...


Instead, I believe the better solution is to pass around the
lookup table to the generic compute functions, so that your code:

> print Bio::Tools::Hydrophobicity->kyte_doolittle($seq); # yee haw!

can be replaced by:

> print Bio:Tools::Hydrophobicity->compute($seq);

using kyte_doolittle as the default; or if you want a different
table, then
> print Bio:Tools::Hydrophobicity->compute($seq,
>      \%Bio:Data::esg_hydrophobicity)


In addition, the code to convert a user selection into the appropriate
scale will be:

  if ($selection eq "kyte_doolittle") {
    $hydrophobicity_scale = \%kyte_doolittle_scale;
  } elseif ($selection eq "esg") {
    $hydrophobicity_scale = \%esg_scale;
  } ...

and will not change if you add more types of functions which
depend on the data values being used.


Since the code is smaller, more generalized and easier to use but
with very little increase in complexity, I consider this a much
better style.


  Here's a similar example where passing a hash also makes more
sense:

sub compute {takes a $seq and a $scale, which is a string) {
  if ($scale eq "kyte_doolittle") {
    %values = %kyte_doolittle_scale;
  } elseif ($scale eq "esg") {
    %values = %esr_scale;
  } ...
  total = 0
  foreach residue in seq:
    total += values[residue]
  return total / length(seq)
}

  I saw some code like this a few weeks ago.  The problem here
is that everything which uses this interface (where the string
name is passed around instead of the hash map) has its own routine
to do the conversion -- too much duplication.  Actually, the code
I saw needed to do two conversions, one from the command line
option to the input string (eg, convert "-kd" into "kyte_doolittle")
and the other from the input string to the hash values (eg,
convert "kyte_doolittle" to "%kyte_doolittle_table").  By passing
a hash around, only one translation is needed, from "-kd" to
"\%kyte_doolittle".  Additionally, the computation code wouldn't
need to raise an error if the name was misspelled.


  The general principle is: push decision logic up as early as you
can, not embedded in the code.  This simplifies the code, and
if you ever have to do code coverage testing, simplifies your
test case development by a lot.

						Andrew Dalke
						dalke@bioreason.com
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================