[BioPython] kcluster and distances

Michiel Jan Laurens de Hoon mdehoon at ims.u-tokyo.ac.jp
Sat Mar 5 00:05:49 EST 2005


Scott Rifkin wrote:
> The euclidean distance function in cluster.c is:
> 
> { double result = 0.;
 > ...
>   result /= tweight;
>   result *= n;
>   return result;
> }
> 
> why at the end is the result multiplied by n?

Typically, all the weights are one. Then tweight is equal to n, and the 
result is equal to the usual definition of the Euclidean distance.

In the latest version of the C Clustering Library (which is not yet 
uploaded to the Biopython CVS), I removed the multiplication by n. The 
euclid function then returns the mean square distance, which may be 
easier to interpret.

> and why isn't the square root of result given as the distance?

Taking the square root adds another calculation step, but won't affect 
the (hierarchical) clustering result. So we may as well leave it out. If 
desired, users can take the square root of the node distances after the 
clustering calculation has finished.

Within the context of k-means clustering, not taking the square root is 
actually the right thing, as we want to minimize the sum of square 
distances.

--Michiel.


More information about the BioPython mailing list