[Bioperl-l] Protein Isoelectric Point calculation

Southern, Mark R mark_southern@merck.com
Fri, 06 Sep 2002 09:43:14 -0400


This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C255AB.59771460
Content-Type: text/plain; 
 charset=iso-8859-1
Content-Transfer-Encoding: 7bit

See POD documentation below ... 
This is something I discussed with Jason some time back but has taken a
while to go through disclosure. I had a need for a protein isoelectric point
calculation. There are programs out there but I wanted to understand how it
worked and came across a very decent description of an algorithm (see
acknowledgments) . Depending on the pK values chosen it compares well to
other options.
I thought that this code, in modified form might fit into the
Bio::Tools::SeqStats area.
Best wishes,
Mark.
Mark Southern
Merck & Co. Inc.
Dept. of Bioinformatics
RY80-A1
PO Box 2000
Rahway NJ 07065
* (732) 594-3777
* (732) 594-2929
 <<pICalculator.pm>> 
NAME
pICalculator
DESCRIPTION
Calculates the isoelectric point of a protein, the pH at which there is no
overall charge on the protein.
Calculates the charge on a protein at a given pH.
SYNOPSIS
use pICalculator;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -fh => \*STDIN ,-format => 'Fasta' );
my $calc = pICalculator->new(-places => 2);
while ( my $seq = $in->next_seq ) {
    $calc->seq( $seq );

    my $iep = $calc->iep;

    print sprintf( "%s\t%s\t%.2f\n"
                  ,$seq->id
                  ,$iep
                  ,$calc->charge_at_pH($iep)
                );
    for( my $i = 0; $i <= 14; $i += 0.5 ){
            print sprintf( "\tpH = %.2f\tCharge = %.2f\n"
                      ,$i
                      ,$calc->charge_at_pH($i)
                     );
    }
}
CONSTRUCTOR
new
$calc = pICalculator->new( [ [ -pKset => \%pKvalues ] [ -pKset =>
'valid_string' ] ] ,-seq => $seq # Bio::Seq ,-places => 2 );
Constructs a new pICalculator. Arguments are a flattened hash. Valid keys
are;
-pKset A reference to a hash with key value pairs for the pK values of the
charged amino acids. Required keys are;

                 N_term   C_term   K   R   H   D   E   C   Y
-pKset A valid string ( 'DTASelect' or 'EMBOSS' ) that will specify an
internal set of pK values to be used. The default is 'EMBOSS'
-seq A Bio::Seq sequence to analyze
-places The number of decimal places to use in the isoelectric point
calculation. The default is 2.
METHODS
iep $calc->iep
Calculates the isoelectric point of the given protein. The value is
returned.
charge_at_pH $calc->charge_at_pH(7)
Calculates the charge on the protein at a given pH.
seq $seq = $calc->seq( $seq )
Sets or returns the Bio::Seq used in the calculation.
pKSet $pkSet = $calc->pKSet( \%pKSet );
Sets or returns the hash of pK values used in the calculation.
SEE ALSO
<http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf>
<http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/iep.html>
<http://us.expasy.org/tools/pi_tool.html>
BUGS
Please report them!
LIMITATIONS
There are various sources for the pK values of the amino acids. The set of
pK values chosen will affect the pI reported.
The charge state of each residue is assumed to be independent of the others.
Protein modifications (such as a phosphate group) that have a charge are
ignored
AUTHOR
Mark Southern (mark_southern@merck.com <mailto:mark_southern@merck.com>)
>From an algorithm by David Tabb found at
<http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf>
COPYRIGHT
Copyright (c) 2002, Merck & Co. Inc. All Rights Reserved. This module is
free software. It may be used, redistributed and/or modified under the terms
of the Perl Artistic License (see
<http://www.perl.com/perl/misc/Artistic.html)>


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named in this message.  If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.

==============================================================================

------_=_NextPart_000_01C255AB.59771460
Content-Type: application/octet-stream; 
 name=pICalculator.pm
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; 
 filename=pICalculator.pm

package pICalculator;=0A=
    =0A=
use Bio::Seq;=0A=
=0A=
# pK values from the DTASelect program from Scripps=0A=
# http://fields.scripps.edu/DTASelect=0A=
our $DTASelect_pK =3D { N_term   =3D>  8.0=0A=
                     ,K        =3D> 10.0 # Lys=0A=
                     ,R        =3D> 12.0 # Arg=0A=
                     ,H        =3D>  6.5 # His=0A=
                     ,D        =3D>  4.4 # Asp=0A=
                     ,E        =3D>  4.4 # Glu=0A=
                     ,C        =3D>  8.5 # Cys=0A=
                     ,Y        =3D> 10.0 # Tyr=0A=
                     ,C_term   =3D>  3.1=0A=
                    };=0A=
=0A=
# pK values from the iep program from EMBOSS=0A=
# http://www.hgmp.mrc.ac.uk/Software/EMBOSS/=0A=
our $Emboss_pK    =3D { N_term   =3D>  8.6=0A=
                     ,K        =3D> 10.8 # Lys=0A=
                     ,R        =3D> 12.5 # Arg=0A=
                     ,H        =3D>  6.5 # His=0A=
                     ,D        =3D>  3.9 # Asp=0A=
                     ,E        =3D>  4.1 # Glu=0A=
                     ,C        =3D>  8.5 # Cys=0A=
                     ,Y        =3D> 10.1 # Tyr=0A=
                     ,C_term   =3D>  3.6=0A=
                    };=0A=
=0A=
sub new{=0A=
    my( $self, %opts ) =3D @_;=0A=
    my $this =3D bless {}, ref $self || $self;=0A=
    $this->seq( $opts{-seq} ) if exists $opts{-seq};=0A=
    $this->pKset( $opts{-pKset} || 'EMBOSS' );=0A=
    exists $opts{-places} ? $this->places( $opts{-places} ) : =
$this->places(2);=0A=
    return $this;=0A=
}=0A=
=0A=
sub seq{=0A=
    my( $this, $seq ) =3D @_;=0A=
    unless( defined $seq && UNIVERSAL::isa($seq,'Bio::Seq') ){=0A=
        die $seq . " is not a valid Bio::Seq object\n";=0A=
    }=0A=
    $this->{-seq} =3D $seq;=0A=
    $this->{-count} =3D count_charged_residues( $seq );=0A=
    return $this->{-seq};=0A=
}=0A=
=0A=
sub pKset{=0A=
    my ( $this, $pKset ) =3D @_;=0A=
    if( ref $pKset eq 'HASH' ){         # user defined pK values=0A=
        $this->{-pKset} =3D $pKset;=0A=
    }elsif( $pKset =3D~ /^emboss$/i ){    # from EMBOSS's iep =
program=0A=
        $this->{-pKset} =3D $Emboss_pK;=0A=
    }elsif( $pKset =3D~ /^dtaselect$/i ){ # from DTASelect (scripps)=0A=
        $this->{-pKset} =3D $DTASelect_pK;=0A=
    }else{                              # default to EMBOSS=0A=
        $this->{-pKset} =3D $Emboss_pK;=0A=
    }=0A=
    return $this->{-pKset};=0A=
}=0A=
=0A=
sub places{=0A=
    my $this =3D shift;=0A=
    $this->{-places} =3D shift if @_;=0A=
    return $this->{-places};=0A=
}=0A=
=0A=
sub iep{=0A=
    my $this =3D shift;=0A=
    return calculate_iep( $this->{-pKset}=0A=
                         ,$this->{-places}=0A=
                         ,$this->{-seq}=0A=
                         ,$this->{-count} =0A=
                        );=0A=
}=0A=
=0A=
sub charge_at_pH{=0A=
    my $this =3D shift;=0A=
    return calculate_charge_at_pH( shift, $this->{-pKset}, =
$this->{-count} );=0A=
}=0A=
=0A=
sub count_charged_residues{=0A=
    my $seq =3D shift;=0A=
    my $sequence =3D $seq->seq;=0A=
    my $count;=0A=
    for ( qw( K R H D E C Y ) ){ # charged AA's=0A=
        $count->{$_}++ while $sequence =3D~ /$_/ig;=0A=
    }=0A=
    return $count;=0A=
}=0A=
=0A=
sub calculate_iep{=0A=
    my( $pK, $places, $seq, $count ) =3D @_;=0A=
    my $pH =3D 7.0;=0A=
    my $step =3D 3.5;=0A=
    my $last_charge =3D 0.0;=0A=
    my $format =3D "%.${places}f";=0A=
=0A=
    unless( defined $count ){=0A=
        $count =3D count_charged_residues($seq);=0A=
    }=0A=
    while(1){=0A=
        my $charge =3D calculate_charge_at_pH( $pH, $pK, $count );=0A=
        last if sprintf($format,$charge) =3D=3D =
sprintf($format,$last_charge);=0A=
        $charge > 0 ? ( $pH +=3D $step ) : ( $pH -=3D $step );=0A=
        $step /=3D 2.0; =0A=
        $last_charge =3D $charge;=0A=
    }=0A=
    return sprintf( $format, $pH );=0A=
}=0A=
=0A=
sub calculate_charge_at_pH{ =0A=
# its the sum of all the partial charges for the =0A=
# termini and all of the charged aa's!=0A=
    my( $pH, $pK, $count ) =3D @_;=0A=
    my $charge =3D               partial_charge( $pK->{N_term}, $pH )  =
# +ve=0A=
               + $count->{K} * partial_charge( $pK->{K}     , $pH )  # =
.=0A=
               + $count->{R} * partial_charge( $pK->{R}     , $pH )  # =
.=0A=
               + $count->{H} * partial_charge( $pK->{H}     , $pH )  # =
.=0A=
               - $count->{D} * partial_charge( $pH, $pK->{D}      )  # =
-ve=0A=
               - $count->{E} * partial_charge( $pH, $pK->{E}      )  # =
.=0A=
               - $count->{C} * partial_charge( $pH, $pK->{C}      )  # =
.=0A=
               - $count->{Y} * partial_charge( $pH, $pK->{Y}      )  # =
.=0A=
               -               partial_charge( $pH, $pK->{C_term} ); # =
.=0A=
    return $charge;=0A=
}=0A=
=0A=
# Concentration Ratio is 10**(pK - pH) for positive groups =0A=
# and 10**(pH - pK) for negative groups=0A=
sub partial_charge{=0A=
    my $cr =3D 10 ** ( $_[0] - $_[1] );=0A=
    return $cr / ( $cr + 1 );=0A=
}=0A=
=0A=
__END__=0A=
=0A=
=3Dhead1 NAME=0A=
=0A=
pICalculator=0A=
=0A=
=3Dhead1 DESCRIPTION=0A=
=0A=
Calculates the isoelectric point of a protein, the pH at which there is =
no =0A=
overall charge on the protein.=0A=
=0A=
Calculates the charge on a protein at a given pH.=0A=
=0A=
=3Dhead1 SYNOPSIS=0A=
=0A=
use pICalculator;=0A=
=0A=
use Bio::SeqIO;=0A=
=0A=
my $in  =3D Bio::SeqIO->new( -fh     =3D> \*STDIN=0A=
                          ,-format =3D> 'Fasta' );=0A=
=0A=
my $calc =3D pICalculator->new(-places =3D> 2);=0A=
=0A=
while ( my $seq =3D $in->next_seq ) {=0A=
=0A=
    $calc->seq( $seq );=0A=
    =0A=
    my $iep =3D $calc->iep;=0A=
    =0A=
    print sprintf( "%s\t%s\t%.2f\n"=0A=
                  ,$seq->id=0A=
                  ,$iep=0A=
                  ,$calc->charge_at_pH($iep)=0A=
                );=0A=
=0A=
    for( my $i =3D 0; $i <=3D 14; $i +=3D 0.5 ){=0A=
=0A=
 	    print sprintf( "\tpH =3D %.2f\tCharge =3D %.2f\n"=0A=
                      ,$i=0A=
                      ,$calc->charge_at_pH($i)=0A=
                     );=0A=
=0A=
    }   =0A=
=0A=
}=0A=
=0A=
=3Dhead1 CONSTRUCTOR=0A=
=0A=
B<new>=0A=
=0A=
$calc =3D pICalculator->new( [ [ -pKset =3D> \%pKvalues ] =0A=
                             [ -pKset =3D> 'valid_string' ] =0A=
                           ]=0A=
                          ,-seq    =3D> $seq # Bio::Seq=0A=
                          ,-places =3D> 2 =0A=
                         );=0A=
=0A=
Constructs a new pICalculator. Arguments are a flattened hash. Valid =
keys are;=0A=
=0A=
B<-pKset>        A reference to a hash with key value pairs for the pK =
values of=0A=
=0A=
                 the charged amino acids. Required keys are;=0A=
                 =0A=
                 N_term   C_term   K   R   H   D   E   C   Y=0A=
=0A=
B<-pKset>        A valid string ( 'DTASelect' or 'EMBOSS' ) that will =
specify an=0A=
                 =0A=
                 internal set of pK values to be used. The default is =
'EMBOSS'=0A=
=0A=
B<-seq>          A Bio::Seq sequence to analyze=0A=
=0A=
B<-places>       The number of decimal places to use in the isoelectric =
point=0A=
=0A=
                 calculation. The default is 2.=0A=
=0A=
=3Dhead1 METHODS=0A=
=0A=
B<iep>           $calc->iep=0A=
=0A=
Calculates the isoelectric point of the given protein. The value is =
returned.=0A=
=0A=
B<charge_at_pH>  $calc->charge_at_pH(7)=0A=
=0A=
Calculates the charge on the protein at a given pH.=0A=
=0A=
B<seq>           $seq =3D $calc->seq( $seq )=0A=
=0A=
Sets or returns the Bio::Seq used in the calculation.=0A=
=0A=
B<pKSet>         $pkSet =3D $calc->pKSet( \%pKSet );=0A=
=0A=
Sets or returns the hash or pK values used in the calculation.=0A=
=0A=
=3Dhead1 SEE ALSO=0A=
=0A=
http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf=0A=
=0A=
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/iep.html=0A=
=0A=
http://us.expasy.org/tools/pi_tool.html=0A=
=0A=
=3Dhead1 BUGS=0A=
=0A=
Please report them!=0A=
=0A=
=3Dhead1 LIMITATIONS=0A=
=0A=
There are various sources for the pK values of the amino acids. The set =
of pK =0A=
values chosen will affect the pI reported.=0A=
=0A=
The charge state of each residue is assumed to be independent of the =
others.=0A=
=0A=
Protein modifications (such as a phosphate group) that have a charge =
are ignored=0A=
=0A=
=3Dhead1 AUTHOR=0A=
=0A=
Mark Southern (mark_southern@merck.com)=0A=
=0A=
>From an algorithm by David Tabb found at =0A=
http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf=0A=
=0A=
=3Dhead1 COPYRIGHT=0A=
=0A=
Copyright (c) 2002, Merck & Co. Inc. All Rights Reserved.=0A=
This module is free software. It may be used, redistributed=0A=
and/or modified under the terms of the Perl Artistic License=0A=
(see http://www.perl.com/perl/misc/Artistic.html)=0A=
=0A=
=3Dcut=0A=

------_=_NextPart_000_01C255AB.59771460--