[BioPython] kcluster and distances
Michiel Jan Laurens de Hoon
mdehoon at ims.u-tokyo.ac.jp
Sat Mar 5 00:05:49 EST 2005
Scott Rifkin wrote:
> The euclidean distance function in cluster.c is:
>
> { double result = 0.;
> ...
> result /= tweight;
> result *= n;
> return result;
> }
>
> why at the end is the result multiplied by n?
Typically, all the weights are one. Then tweight is equal to n, and the
result is equal to the usual definition of the Euclidean distance.
In the latest version of the C Clustering Library (which is not yet
uploaded to the Biopython CVS), I removed the multiplication by n. The
euclid function then returns the mean square distance, which may be
easier to interpret.
> and why isn't the square root of result given as the distance?
Taking the square root adds another calculation step, but won't affect
the (hierarchical) clustering result. So we may as well leave it out. If
desired, users can take the square root of the node distances after the
clustering calculation has finished.
Within the context of k-means clustering, not taking the square root is
actually the right thing, as we want to minimize the sum of square
distances.
--Michiel.
More information about the BioPython
mailing list