Bioperl: "lightweight analyses"

Andrew Dalke dalke@bioreason.com
Sun, 07 Mar 1999 18:54:58 -0800


This is a multi-part message in MIME format.
--------------575712A676C9739C0C568F4A
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I was looking at:
  http://bio.perl.org/Projects/Sequence/

where it says:
> Desirable lightweight analyses not yet written 
> 
>   Property analysis (MW, pI, etc.) 
>   Metric "sliding window" properties (GC-content, hydrophobicity, etc.)

For some strange reason, I was in the mood to write some perl today,
so I implemented at least the algorithms for doing these.  I don't
do much with the bioperl code itself, so someone else will have to
look into integrating this into the existing framework.  Also, as
I don't do much perl programming these days, that person should also
check to make sure I did things the modern way.

I've attached the code to this email.  It's only a few routines
along with some test scripts and examples.  With it, the calculation
of the approx. molecular weight is:

   $w = &sum(&map_properties($seq, \%weights));

(of course, you need a list of molecular weights to do this mapping).
The code is similar for mapping any sort of property onto a residue;
though I make the assumption that all inputs will be a string of
residues in their single character abbreviation.

I also have a sliding window code.  It is of the form:

  window($size, $ref_to_list_of_values [, $ref_to_function])

where the ref_to_function is a reference to a function which takes
a list and returns the computed value for that window.  If it
isn't given, the default is to compute an average.

Tieing them together gives a quick way to compute the windowed
average of hydrophobicity values:

@hydrophobicity = &map_properties($protein, \%scale_B);
@avg_hydro = &window(5, \@hydrophobicity);

and if you want to do a weighted triangle average over 3 terms you
could do:

sub tri3 {
  return ($_[0] + 2*$_[1] + $_[2]) / 4;
}
@avg_hydro = &window(1, \@hydrophobicity, \&tri3);

I *think* GC content would be:

GC = ("G" => 1, "C" => 1, "A" => 0, "T" => 0);
@gc_content = &map_properties($nt, \%GC);
$total_gc_content = &find_average(@gc_content);
$windowed_gc_content = &window(3, @gc_content);

I say *think* because I'm more of a protein person and I'm not sure
what "GC content" means; I assume it's the number of G and C bases
over the total number of bases in the region considered.  I also
say it because I haven't tested this example code.

Legal notice:  this code may be distributed under whatever the
standard bioperl license is.  If there isn't a standard license, it
may be be distributed under terms equal to the Python license (with
the appropriate replacement of names to Andrew Dalke and Bioreason,
Inc.).

						Andrew
						dalke@bioreason.com
--------------575712A676C9739C0C568F4A
Content-Type: application/x-perl; name="seqmap.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="seqmap.pl"

#!/usr/local/bin/perl -w


sub map_properties {
  my($seq, $props) = @_;

  my(%props) = %$props;       # turn ref to map into map
  @seq = split('', $seq);     # turn string into list of characters

  return map($props{$_}, @seq);  # map the properties onto the residues
}

# Simple sum followed by divide.
# Hopefully you won't go past maxint :)
sub find_average {
  my(@l) = @_;
  my($sum) = 0.0;
  foreach (@l) {
    $sum += $_;
  }
  return $sum / ($#l+1);
}

# This *has* to be an existing perl function, but I don't
# know what it is.
sub sum {
  my($sum);
  foreach (@_) {
    $sum += $_;
  }
  return $sum;
}
# Go over a list (reference with a sliding window.  For each window
# position,  compute some value (if the function to compute the value
# is not given, compute the average).
# Return the list of computed window values.  This list will be
# 2*$size shorter than the original list because of edge effects.
# The effective window size is 2*$size+1, so
#   if $size == 0, $window_size == 1
#   if $size == 2, $window_size == 5,
# etc.
sub window {
  my($size, $list, $func) = @_;
  my($window_size) = 2*$size+1;
  my(@list) = @$list;
  my($i, @ret);
  
  # If there is a user defined function
  if (defined $func) {
    my($end) = $#list - $window_size + 1;
    # pass window sized slices to the function for farther evaluation
    for ($i=0; $i<=$end; $i++) {
      # and save the results
      push(@ret, &$func(@list[$i, $i+$window_size-1]))
    }
  } else {
    # compute the average directly
    # accumulates the sum instead of recomputing all window sums
    my($sum) = 0.0;
    # Get the initial window sum (sans last term)
    for ($i=0; $i<$window_size-1; $i++) {
      $sum += $list[$i];
    }
    # go over the rest of the list, adding and subtracting as needed
    for (; $i<=$#list; $i++) {
      $sum += $list[$i];
      push(@ret, $sum / $window_size);
      $sum -= $list[$i-$window_size];
    }
  }
  return @ret;
}


## Some tests of the above
%test_hash = (
	      'A' => 0,
	      'T' => 1,
	      'C' => 3,
	      'G' => 8,
	      );

$seq = 'AATCGAGCACAACTATGCGCTACATGATGACTGCGAACT';

@p = map_properties($seq, \%test_hash);

for ($i=0; $i<length($seq); $i++) {
  if ($p[$i] != $test_hash{substr($seq,$i, 1)}) {
    print "Problem at $i\n";
  }
}

# check to see that the length is the summed weight
%unit_weight = (
	      'A' => 1,
	      'T' => 1,
	      'C' => 1,
	      'G' => 1,
	      );
$w = &sum(&map_properties($seq, \%unit_weight));
if (length($seq) != $w) {
  print "Weights are different: got $w instead of ";
  print length($seq);
  print "\n";
}


# List and the average of the list
@test_set =  ([[1, 1, 1, 1, 1], 1],
	      [[0, 1, 2, 3, 4], 2],
	      [[0], 0],
	      [[-3], -3],
	      [[1, 2], 1.5],
	      );
foreach $test (@test_set) {
  $l = (@$test)[0];
  $v = (@$test)[1];
  $x = &find_average(@$l);
  if ($x != $v) {
    print "Not the same for @$l: $x != $v\n";
  }    
}

# A window size of 1 with averaging shouldn't change the list
foreach $test (@test_set) {
  $l = (@$test)[0];
  $v = (@$test)[1];
  @x = &window(0, $l);
  if (@x != @$l) {
    print "Not the same: @$l => @x\n";
  }
}

# Now use a function to do the averaging
foreach $test (@test_set) {
  $l = (@$test)[0];
  $v = (@$test)[1];
  @x = &window(0, $l, \&find_average);
  if (@x != @$l) {
    print "Not the same with average function: @$l => @x\n";
  }
}

@test_set2 = ([[1, 1, 1], [1]],
	      [[1, 1, 1, 1], [1, 1]],
	      [[1, 1, 1, 1, 1], [1, 1, 1]],
	      [[1, 2, 3, 4, 5], [2, 3, 4]],
	      [[1, 2, 4, 8], [7/3, 14/3]],
	      );

foreach $test (@test_set2) {
  $l = (@$test)[0];
  $v = (@$test)[1];
  @x = &window(1, $l);
  if (@x != @$v) {
    print "why are these different? @$v and @x\n";
  }
}

# And now, some examples


# Don't have a good list handy, so grabbed the data from:
#    http://selene.biochem.uga.edu/tutorial/aaprops.html
# Of course, a real program should deal with the ends properly, but
# I don't have that data (so don't trust the result to 5 sig figs!)
# This table is in Daltons.
%weights = (
	    'A' => 71.09,
	    'C' => 103.15,
	    'D' => 114.08,
	    'E' => 128.11,
	    'F' => 147.19,
	    'G' => 57.07,
	    'H' => 138.16,
	    'I' => 113.18,
	    'K' => 129.19,
	    'L' => 113.18,
	    'M' => 131.21,
	    'N' => 114.1,
	    'P' => 97.13,
	    'Q' => 128.14,
	    'R' => 157.2,
	    'S' => 87.09,
	    'T' => 101.12,
	    'V' => 99.15,
	    'W' => 186.23,
	    'Y' => 163.12,
	   );

# The following comes from Branden and Tooze, p210 who go to reference

# K.Kyte and R.F.Doolittle
%scale_A = (
	    'F' => 2.8,
	    'M' => 1.9,
	    'I' => 4.5,
	    'L' => 3.8,
	    'V' => 4.2,
	    'C' => 2.5,
	    'W' => -0.9,
	    'A' => 1.8,
	    'T' => -0.7,
	    'G' => -0.4,
	    'S' => -0.8,
	    'P' => -1.6,
	    'Y' => -1.3,
	    'H' => -3.2,
	    'Q' => -3.5,
	    'N' => -3.5,
	    'E' => -3.5,
	    'K' => -3.9,
	    'D' => -3.5,
	    'R' => -4.5,
	   );

# D.A.Engelman, T.A.Steitz, and A. Goldman
%scale_B = (
	    'F' => 3.7,
	    'M' => 3.4,
	    'I' => 3.1,
	    'L' => 2.8,
	    'V' => 2.6,
	    'C' => 2.0,
	    'W' => 1.9,
	    'A' => 1.6,
	    'T' => 1.2,
	    'G' => 1.0,
	    'S' => 0.6,
	    'P' => -0.2,
	    'Y' => -0.7,
	    'H' => -3.0,
	    'Q' => -4.1,
	    'N' => -4.8,
	    'E' => -8.2,
	    'K' => -8.8,
	    'D' => -9.2,
	    'R' => -12.3,
	   );

sub pretty_plot {
  my($min, $max, $width, @vals) = @_;
  my($pos);
  my($scale) = $width / ($max - $min);
  print "      " . "-" x $width . "\n";
  my($i) = 0;
  foreach (@vals) {
    $pos = int(($_ - $min) * $scale);
    print sprintf("%5d ", $i) . ' ' x ($pos-1) . '*' . "\n";
    $i++;
  }
  print "      " . "-" x $width . "\n";
}

# Picked
#  gi|2624664|pdb|1AIJ|M Chain M, Photosynthetic Reaction Center From
#  Rhodobacter Sphaeroides In The Charge-Neutral Dqaqb State
$protein = <<EOM;
AEYQNIFSQVQVRGPADLGMTEDVNLANRSGVGPFSTLLGWFGNAQLGPIYLGSLGVLSLFSGLMWFFTI
GIWFWYQAGWNPAVFLRDLFFFSLEPPAPEYGLSFAAPLKEGGLWLIASFFMFVAVWSWWGRTYLRAQAL
GMGKHTAWAFLSAIWLWMVLGFIRPILMGSWSEAVPYGIFSHLDWTNNFSLVHGNLFYNPFHGLSIAFLY
GSALLFAMHGATILAVSRFGGERELEQIADRGTAAERAALFWRWTMGFNATMEGIHRWAIWMAVLVTLTG
GIGILLSGTVVDNWYVWGQNHGMAPLN
EOM
$protein =~ s/[^A-Z]//g;

@hydrophobicity = &map_properties($protein, \%scale_B);

print "Raw values\n";
&pretty_plot(-12.5, 4, 70, @hydrophobicity);

print "\n\nSmoothed values\n";
@avg_hydro = &window(5, \@hydrophobicity);
&pretty_plot(-12.5, 4, 70, @avg_hydro);


$w = &sum(&map_properties($seq, \%weights));
print "The molecular weight of that protein is: $w\n";


--------------575712A676C9739C0C568F4A--

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================