[BioPython] Protparam using BioPython

Fri Apr 27 09:55:42 UTC 2007

Shameer Khadar wrote:
> Dear Peter,
> 
> Thanks for your reply.

Sorry for the delay - I was away on a course this week.

 > I was looking for a script based on Bio.SeqUtils.
> I got the following script from a website, its working perfect for me. But
> the problem is i have around 1000 sequence (in raw format without headers)
> and i thought to process it using a foreach equivalent in python(I am a
> python newbie). But its only a couple of minutes back i came to know that
> there is no foreach in python, but some better alternative is available
> !!!.

There is a "for each" equivalent in python! 
http://docs.python.org/tut/node6.html

If you don't have a good introductory python book, that online tutorial 
is an excellent starting point.

 > It will be great if you can help to process my file using this
> program.
> 
> program :
> from Bio.SeqUtils import ProtParam, ProtParamData
> def PrintDictionary(MyDict):
>         for i in MyDict.keys():
>                 print "%s\t%.2f" %(i, MyDict[i])
>         print "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNGGHFL"
> X = ProtParam.ProteinAnalysis("")
> print "Instability index of test protein: %.2f" % X.instability_index()

It seems like you have only given bits of a program, so I have tried to 
guess what you meant.

> first few lines of my file :
> AEGEFAHLYGTFRED
> AEGEFAHLZGTFRED
> AEGEFGATYGVYTSD
> AEGEFGATZGVYTSD
> AEGEFGATYGVZTSD
> AEGEFGATZGVZTSD
> AEGEFLYGEIQGTQD

In the following example, I am assuming your sequences are in a plain 
text file, called protparam.txt, which contains each sequence on a 
single line.

Try something like this first of all, and make sure that it prints out 
your sequences correctly:

for line in open("protparam.txt") :
     #Remove any trailing new lines or white space
     seq_string = line.rstrip()
     print "Sequence <%s>" % seq_string

Then try doing the ProtParam.ProteinAnalysis of each sequence string:

from Bio.SeqUtils import ProtParam, ProtParamData
for line in open("protparam.txt") :
     #Remove any trailing new lines or white space
     seq_string = line.rstrip()
     print "Sequence <%s>" % seq_string
     X = ProtParam.ProteinAnalysis(seq_string)
     print "Instability index: %.2f" % X.instability_index()

You'll find it doesn't like the "Z" (presumably this is Glx - glutamic 
acid or glutamine? i.e. E or Q) present in many of your sequences, so 
this next version uses error handling to note this and then carry on to 
the next sequence:

from Bio.SeqUtils import ProtParam, ProtParamData
for line in open("protparam.txt") :
     #Remove any trailing new lines or white space
     seq_string = line.rstrip()

     print #blank line
     print "Sequence <%s>" % seq_string
     X = ProtParam.ProteinAnalysis(seq_string)
     try :
         print "Instability index: %.2f" % X.instability_index()
     except KeyError, e :
         print "Problem with the letter %s in the sequence?" % str(e)

The output is:

Sequence <AEGEFAHLYGTFRED>
Instability index: 8.39

Sequence <AEGEFAHLZGTFRED>
Problem with the letter 'Z' in the sequence?

Sequence <AEGEFGATYGVYTSD>
Instability index: -17.70

Sequence <AEGEFGATZGVYTSD>
Problem with the letter 'Z' in the sequence?

Sequence <AEGEFGATYGVZTSD>
Problem with the letter 'Z' in the sequence?

Sequence <AEGEFGATZGVZTSD>
Problem with the letter 'Z' in the sequence?

Sequence <AEGEFLYGEIQGTQD>
Instability index: 8.61

You'll have to check yourself to see if these numbers are sensible.  I 
don't know what to suggest for your "Z" entries - the stability will be 
different if you try using E or Q instead.

Peter