[Bioperl-l] Protein Isoelectric Point calculation

Aaron J Mackey Aaron J. Mackey" <amackey@virginia.edu
Wed, 11 Sep 2002 07:34:23 -0400 (EDT)


Looks great.  I have just a couple of questions about the subject, not
necessarily about your implementation in particular ...

Previously, when a request for a pI tool came up, I've posted some
Inline::C code (which is simply a wrapper around the C code sent to me by
the Expasy folk, that underlies their pI tool).  I've found that due to
the nature of the iterative algorithm, that the C version is significantly
faster (I'll run off and benchmark Mark's vs. my C version and report
back).  So question #1 is whether there is any interest for a parallel C
implementation (to live in bioperl-ext I guess) that would take the place
of a pure-perl version?  I would strive to make it
feature/interface-identical to Mark's if there were much interest in this
(since we use it in a database-loading script, it really is a time saver
when importing nearly a million sequences).

Secondly, for historical interest, you may want to check out:

Ribeiro JM. Sillero A. An algorithm for the computer calculation of the
coefficients of a polynomial that allows determination of isoelectric
points of proteins and other macromolecules. [Journal Article] Computers
in Biology & Medicine. 20(4):235-42, 1990

Ribeiro JM. Sillero A. A program to calculate the isoelectric point of
macromolecules. [Journal Article] Computers in Biology & Medicine.
21(3):131-41, 1991

Sillero A. Ribeiro JM. Isoelectric points of proteins: theoretical
determination. [Journal Article] Analytical Biochemistry. 179(2):319-25,
1989 Jun

All of these describe non-iterative, exact (or nearly exact heuristic)
algorithms for calculating a pI given a macromolecule with more than 3
classes of pKa's (since for 3 or less classes, a closed form solution
exists; you may want to check for those cases and solve exactly rather
than taking the time to iterate).

You ask what all the trouble is?  It turns out that in proteomics, at
least, having a database annotated with molecular weight and pI allows you
to run database searches in "windows" of MW and/or pI, thus shrinking your
search size and improving the expecation of any hit.  So good, fast pI
algorithms are important ...

Just my usual 4 cents ;)

-Aaron

On Fri, 6 Sep 2002, Southern, Mark R wrote:

> See POD documentation below ...
> This is something I discussed with Jason some time back but has taken a
> while to go through disclosure. I had a need for a protein isoelectric point
> calculation. There are programs out there but I wanted to understand how it
> worked and came across a very decent description of an algorithm (see
> acknowledgments) . Depending on the pK values chosen it compares well to
> other options.
> I thought that this code, in modified form might fit into the
> Bio::Tools::SeqStats area.
> Best wishes,
> Mark.
> Mark Southern
> Merck & Co. Inc.
> Dept. of Bioinformatics
> RY80-A1
> PO Box 2000
> Rahway NJ 07065
> * (732) 594-3777
> * (732) 594-2929
>  <<pICalculator.pm>>
> NAME
> pICalculator
> DESCRIPTION
> Calculates the isoelectric point of a protein, the pH at which there is no
> overall charge on the protein.
> Calculates the charge on a protein at a given pH.
> SYNOPSIS
> use pICalculator;
> use Bio::SeqIO;
> my $in = Bio::SeqIO->new( -fh => \*STDIN ,-format => 'Fasta' );
> my $calc = pICalculator->new(-places => 2);
> while ( my $seq = $in->next_seq ) {
>     $calc->seq( $seq );
>
>     my $iep = $calc->iep;
>
>     print sprintf( "%s\t%s\t%.2f\n"
>                   ,$seq->id
>                   ,$iep
>                   ,$calc->charge_at_pH($iep)
>                 );
>     for( my $i = 0; $i <= 14; $i += 0.5 ){
>             print sprintf( "\tpH = %.2f\tCharge = %.2f\n"
>                       ,$i
>                       ,$calc->charge_at_pH($i)
>                      );
>     }
> }
> CONSTRUCTOR
> new
> $calc = pICalculator->new( [ [ -pKset => \%pKvalues ] [ -pKset =>
> 'valid_string' ] ] ,-seq => $seq # Bio::Seq ,-places => 2 );
> Constructs a new pICalculator. Arguments are a flattened hash. Valid keys
> are;
> -pKset A reference to a hash with key value pairs for the pK values of the
> charged amino acids. Required keys are;
>
>                  N_term   C_term   K   R   H   D   E   C   Y
> -pKset A valid string ( 'DTASelect' or 'EMBOSS' ) that will specify an
> internal set of pK values to be used. The default is 'EMBOSS'
> -seq A Bio::Seq sequence to analyze
> -places The number of decimal places to use in the isoelectric point
> calculation. The default is 2.
> METHODS
> iep $calc->iep
> Calculates the isoelectric point of the given protein. The value is
> returned.
> charge_at_pH $calc->charge_at_pH(7)
> Calculates the charge on the protein at a given pH.
> seq $seq = $calc->seq( $seq )
> Sets or returns the Bio::Seq used in the calculation.
> pKSet $pkSet = $calc->pKSet( \%pKSet );
> Sets or returns the hash of pK values used in the calculation.
> SEE ALSO
> <http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf>
> <http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/iep.html>
> <http://us.expasy.org/tools/pi_tool.html>
> BUGS
> Please report them!
> LIMITATIONS
> There are various sources for the pK values of the amino acids. The set of
> pK values chosen will affect the pI reported.
> The charge state of each residue is assumed to be independent of the others.
> Protein modifications (such as a phosphate group) that have a charge are
> ignored
> AUTHOR
> Mark Southern (mark_southern@merck.com <mailto:mark_southern@merck.com>)
> >From an algorithm by David Tabb found at
> <http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf>
> COPYRIGHT
> Copyright (c) 2002, Merck & Co. Inc. All Rights Reserved. This module is
> free software. It may be used, redistributed and/or modified under the terms
> of the Perl Artistic License (see
> <http://www.perl.com/perl/misc/Artistic.html)>
>
>
> ------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named in this message.  If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.
>
> ==============================================================================
>

-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey@virginia.edu