[BioPython] Bio.distance

Wed Oct 1 15:49:53 UTC 2008

Michiel de Hoon wrote:
> Hi everybody,
>
> Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution.
>
> --Michiel.
>
>
>       
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
Under the 'standard' install I do not think that there is any advantage 
of using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics 
data set with almost 1500 data points, 8 explanatory variables and k=9. 
I only got a one second difference between using Bio.cdistance or 
commenting it out on my system (after removing the build directory and 
reinstalling everything). Actual maximum times across three runs were 
under 16.6 seconds with it and under 17.4 seconds without it.

My system runs linux x86_64 (fedora 10) but it is not a 'clean' system 
due to other cpu intensive processes running. I used Python 2.5 and 
Numeric 2.4 as I forgot the order of imports. In my version the default 
distance without Bio.cdistance uses the Numeric dot (I did not try the 
python version) so I would expect this to be noticeably faster if lapack 
or atlas are installed than if these are not present.  (I used Fedora 
supplied Numeric so while I think this timing is without lapack and 
atlas I am not completely sure of that.)

I did not see an examples for k-nearest neighbor so below is (very bad) 
code using the logistic regression example 
(http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

Regards
Bruce

from Bio import kNN
xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, 
-220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], 
[58, -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, 
-312.99], [154, -213.83], [147, -380.85], [93, -291.13]]
ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
model = kNN.train(xs, ys, 3)
ccr=0
tobs=0
for px, py in zip(xs, ys):
        cp=kNN.classify(model, px)
        tobs +=1
        if cp==py:
                ccr +=1
print tobs, ccr