[BioPython] Bio.kNN documentation

Bruce Southey bsouthey at gmail.com
Wed Oct 1 18:40:41 UTC 2008


Peter wrote:
> Bruce wrote:
>   
>> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500
>> data points, 8 explanatory variables and k=9. ...
>>     
>
> Do you think this larger example could be adapted into something for
> the Biopython documentation?  Otherwise the next bit of code looks
> interesting.
>
>   
>> I did not see an examples for k-nearest neighbor so below is (very bad)
>> code using the logistic regression example
>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).
>>     
>
> This is a set of Bacillus subtilis gene pairs for which the operon
> structure is known, with the intergene distance and gene expression
> score as explanatory variables, with the class being same operon or
> different operons.
>
>   
>> from Bio import kNN
>> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11,
>> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58,
>> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99],
>> [154, -213.83], [147, -380.85], [93, -291.13]]
>> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
>> model = kNN.train(xs, ys, 3)
>> ccr=0
>> tobs=0
>> for px, py in zip(xs, ys):
>>       cp=kNN.classify(model, px)
>>       tobs +=1
>>       if cp==py:
>>               ccr +=1
>> print tobs, ccr
>>     
>
> Could you expand on the cryptic variable names?  ccr = correct call
> rate? tobs = total observations?
>
> Coupled with a scatter plot (say with pylab, showing the two classes
> in different colours), this could be turned into a nice little example
> for the cookbook section of the tutorial.  Notice that later on in the
> logistic regression example there is a second table of "test data"
> which could be used to make de novo predictions.
>
> Thanks,
>
> Peter
>
>   
I did realize that this was coming... :-)
(I guess I am volunteering myself to provide some material on machine 
learning with BioPython. So this is a start.)

I wanted something quick and dirty to output for testing, so tobs is the 
total number of observations and ccr is number of correctly classified 
points - I was to lazy to divide it by tobs to get the correct 
classification rate.

Here is an more extended sample code that also uses logistic regression. 
(Python is so great to with here!) I don't have plotting packages 
installed but someone could add the plots.

Regards
Bruce






-------------- next part --------------
A non-text attachment was scrubbed...
Name: knn_lr_example.py
Type: text/x-python
Size: 3257 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20081001/40109831/attachment-0002.py>


More information about the Biopython mailing list