[Bioperl-l] Protein Isoelectric Point calculation

Southern, Mark R mark_southern@merck.com
Wed, 11 Sep 2002 17:51:44 -0400


A full four cents worth. Here's some more from me :-)

Firstly; my feeling is that a C implementation is definitely the way to go.
Over a great number of sequences, pure perl implementations, or spawning
sub-processes ( such as to emboss's iep ) are going to be too slow to be
practicable ( I get around this at the moment by massively parallelizing the
pI calculations ). Given this it would be great to have an Inline::C or xs
version that reuses the implementation from iep or the expasy compute pI/Mw
tool in some way. This because I for one wouldn't want to write the C code
;-) and more seriously because users are going to have more confidence in
these algorithms as they already have good visibility. That said see below.

Secondly; what functionality should we provide? The isoelectric point yes,
but also the charge at a particular pH would be nice. The ability to
(dynamically) specify the calculations to a given number of decimal places
is useful and would save useless iterations, and since the pK values are
experimentatal ones then the ability to reset these might not go amiss, or
am i unduly complicating things?

Thirdly; caveat programmer. I dug the expasy code out of the mailing list
archive and have had a terrible time getting it to run on irix or tru64. The
exp10 function could not be found in the standard math library so i
converted from exp10(x) to pow(10.0,x). I then had a problem that i traced
to the following lines in the pIcalc routine;


  i = 0;
  while (c = sequence[i++]) {
    comp[toupper(c) - 'A']++;
  }

// and then later ...

  cTermResidue = toupper(sequence[i - 1]) - 'A';


The problem being that the variable 'i' is actually +2 past the end of the
sequence[] array by the time the while loop is completed and thus the value
of 'cTermResidue' is not correct, or even a residue. On linux the code runs
(producing the wrong result) but on irix or tru64 the result is a
segmentation fault.


Mark.

-----Original Message-----
From: Aaron J Mackey [mailto:ajm6q@virginia.edu]
Sent: Wednesday, September 11, 2002 7:34 AM
To: Southern, Mark R
Cc: 'bioperl-l@bioperl.org'
Subject: Re: [Bioperl-l] Protein Isoelectric Point calculation



Looks great.  I have just a couple of questions about the subject, not
necessarily about your implementation in particular ...

Previously, when a request for a pI tool came up, I've posted some
Inline::C code (which is simply a wrapper around the C code sent to me by
the Expasy folk, that underlies their pI tool).  I've found that due to
the nature of the iterative algorithm, that the C version is significantly
faster (I'll run off and benchmark Mark's vs. my C version and report
back).  So question #1 is whether there is any interest for a parallel C
implementation (to live in bioperl-ext I guess) that would take the place
of a pure-perl version?  I would strive to make it
feature/interface-identical to Mark's if there were much interest in this
(since we use it in a database-loading script, it really is a time saver
when importing nearly a million sequences).

Secondly, for historical interest, you may want to check out:

Ribeiro JM. Sillero A. An algorithm for the computer calculation of the
coefficients of a polynomial that allows determination of isoelectric
points of proteins and other macromolecules. [Journal Article] Computers
in Biology & Medicine. 20(4):235-42, 1990

Ribeiro JM. Sillero A. A program to calculate the isoelectric point of
macromolecules. [Journal Article] Computers in Biology & Medicine.
21(3):131-41, 1991

Sillero A. Ribeiro JM. Isoelectric points of proteins: theoretical
determination. [Journal Article] Analytical Biochemistry. 179(2):319-25,
1989 Jun

All of these describe non-iterative, exact (or nearly exact heuristic)
algorithms for calculating a pI given a macromolecule with more than 3
classes of pKa's (since for 3 or less classes, a closed form solution
exists; you may want to check for those cases and solve exactly rather
than taking the time to iterate).

You ask what all the trouble is?  It turns out that in proteomics, at
least, having a database annotated with molecular weight and pI allows you
to run database searches in "windows" of MW and/or pI, thus shrinking your
search size and improving the expecation of any hit.  So good, fast pI
algorithms are important ...

Just my usual 4 cents ;)

-Aaron

On Fri, 6 Sep 2002, Southern, Mark R wrote:

> See POD documentation below ...
> This is something I discussed with Jason some time back but has taken a
> while to go through disclosure. I had a need for a protein isoelectric
point
> calculation. There are programs out there but I wanted to understand how
it
> worked and came across a very decent description of an algorithm (see
> acknowledgments) . Depending on the pK values chosen it compares well to
> other options.
> I thought that this code, in modified form might fit into the
> Bio::Tools::SeqStats area.
> Best wishes,
> Mark.
> Mark Southern
> Merck & Co. Inc.
> Dept. of Bioinformatics
> RY80-A1
> PO Box 2000
> Rahway NJ 07065
> * (732) 594-3777
> * (732) 594-2929
>  <<pICalculator.pm>>
> NAME
> pICalculator
> DESCRIPTION
> Calculates the isoelectric point of a protein, the pH at which there is no
> overall charge on the protein.
> Calculates the charge on a protein at a given pH.
> SYNOPSIS
> use pICalculator;
> use Bio::SeqIO;
> my $in = Bio::SeqIO->new( -fh => \*STDIN ,-format => 'Fasta' );
> my $calc = pICalculator->new(-places => 2);
> while ( my $seq = $in->next_seq ) {
>     $calc->seq( $seq );
>
>     my $iep = $calc->iep;
>
>     print sprintf( "%s\t%s\t%.2f\n"
>                   ,$seq->id
>                   ,$iep
>                   ,$calc->charge_at_pH($iep)
>                 );
>     for( my $i = 0; $i <= 14; $i += 0.5 ){
>             print sprintf( "\tpH = %.2f\tCharge = %.2f\n"
>                       ,$i
>                       ,$calc->charge_at_pH($i)
>                      );
>     }
> }
> CONSTRUCTOR
> new
> $calc = pICalculator->new( [ [ -pKset => \%pKvalues ] [ -pKset =>
> 'valid_string' ] ] ,-seq => $seq # Bio::Seq ,-places => 2 );
> Constructs a new pICalculator. Arguments are a flattened hash. Valid keys
> are;
> -pKset A reference to a hash with key value pairs for the pK values of the
> charged amino acids. Required keys are;
>
>                  N_term   C_term   K   R   H   D   E   C   Y
> -pKset A valid string ( 'DTASelect' or 'EMBOSS' ) that will specify an
> internal set of pK values to be used. The default is 'EMBOSS'
> -seq A Bio::Seq sequence to analyze
> -places The number of decimal places to use in the isoelectric point
> calculation. The default is 2.
> METHODS
> iep $calc->iep
> Calculates the isoelectric point of the given protein. The value is
> returned.
> charge_at_pH $calc->charge_at_pH(7)
> Calculates the charge on the protein at a given pH.
> seq $seq = $calc->seq( $seq )
> Sets or returns the Bio::Seq used in the calculation.
> pKSet $pkSet = $calc->pKSet( \%pKSet );
> Sets or returns the hash of pK values used in the calculation.
> SEE ALSO
> <http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf>
> <http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/iep.html>
> <http://us.expasy.org/tools/pi_tool.html>
> BUGS
> Please report them!
> LIMITATIONS
> There are various sources for the pK values of the amino acids. The set of
> pK values chosen will affect the pI reported.
> The charge state of each residue is assumed to be independent of the
others.
> Protein modifications (such as a phosphate group) that have a charge are
> ignored
> AUTHOR
> Mark Southern (mark_southern@merck.com <mailto:mark_southern@merck.com>)
> >From an algorithm by David Tabb found at
> <http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf>
> COPYRIGHT
> Copyright (c) 2002, Merck & Co. Inc. All Rights Reserved. This module is
> free software. It may be used, redistributed and/or modified under the
terms
> of the Perl Artistic License (see
> <http://www.perl.com/perl/misc/Artistic.html)>
>
>
>
----------------------------------------------------------------------------
--
> Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and
is intended solely for the use of the individual or entity named in this
message.  If you are not the intended recipient, and have received this
message in error, please immediately return this by e-mail and then delete
it.
>
>
============================================================================
==
>

-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey@virginia.edu




------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named in this message.  If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.

==============================================================================