[BioPython] Bio.kNN documentation

Peter biopython at maubp.freeserve.co.uk
Wed Oct 1 16:17:10 UTC 2008


Bruce wrote:
> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500
> data points, 8 explanatory variables and k=9. ...

Do you think this larger example could be adapted into something for
the Biopython documentation?  Otherwise the next bit of code looks
interesting.

> I did not see an examples for k-nearest neighbor so below is (very bad)
> code using the logistic regression example
> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

This is a set of Bacillus subtilis gene pairs for which the operon
structure is known, with the intergene distance and gene expression
score as explanatory variables, with the class being same operon or
different operons.

> from Bio import kNN
> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11,
> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58,
> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99],
> [154, -213.83], [147, -380.85], [93, -291.13]]
> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
> model = kNN.train(xs, ys, 3)
> ccr=0
> tobs=0
> for px, py in zip(xs, ys):
>       cp=kNN.classify(model, px)
>       tobs +=1
>       if cp==py:
>               ccr +=1
> print tobs, ccr

Could you expand on the cryptic variable names?  ccr = correct call
rate? tobs = total observations?

Coupled with a scatter plot (say with pylab, showing the two classes
in different colours), this could be turned into a nice little example
for the cookbook section of the tutorial.  Notice that later on in the
logistic regression example there is a second table of "test data"
which could be used to make de novo predictions.

Thanks,

Peter



More information about the Biopython mailing list