[Biopython-dev] Bio.Cluster
Iddo Friedberg
idoerg at burnham.org
Fri Jul 18 12:23:08 EDT 2003
Thanks! That looks extremely useful.
One comment (just from reading your email, I haven't looked at this
yet): if the data in the matrix is very sparse, then a NumPy array would
seem redundant in the sense that most of it will be zeros, and the user
wil be trying to pass a huge data structure, in which most of the data
is superfluous. Or am I getting something wrong here?
Thanks!
Iddo
Michiel Jan Laurens de Hoon wrote:
> I have added an option to do hierarchical clustering based on the
> distance matrix directly. The new version in in Biopython's CVS. To
> apply hierarchical clustering to the gene expression data, use
>
> treecluster(my_matrix, ...)
>
> or
>
> treecluster(data=my_matrix, ...)
>
> To do hierarchical clustering on the distance matrix directly, use
>
> treecluster(distancematrix=my_distance_matrix, ...)
>
> where my_distance_matrix is a 2D Numpy array which is symmetric and has
> zeros on the diagonal (though the code does not check for it). This
> works for pairwise single-, maximum-, and average-linkage, but not for
> pairwise centroid-linkage, for which you would need the original gene
> expression data.
>
> I had to make some modifications in the Python <-> C interface for this,
> which tends to be error prone. If you find any bugs, please let me know.
>
> --Michiel.
>
> Iddo Friedberg wrote:
>
>> Dear Michiel,
>>
>> I just looked at the manual for Bio.Cluster (very well written, BTW).
>> Is there a way to do a k-means clustering (or other) based on a
>> distance matrix, rather than on the gene expression vector data? The
>> data i am trying to cluster teh structural similarity of protein
>> structure fragments, and as such already appears in the matrix form.
>>
>> Thanks,
>>
>> ./I
>>
>>
>>
>> Michiel Jan Laurens de Hoon wrote:
>>
>>> Dear biopython developers,
>>>
>>> I have added Bio.Cluster to the Biopython CVS. Bio.Cluster contains
>>> clustering techniques for gene expression data (hierarchical,
>>> k-means, and SOMs); most routines are written in C with a Python
>>> wrapper. This package also exists separately as Pycluster.
>>>
>>> The Python and C source code is in Bio/Cluster; I have also added
>>> Bio.Cluster to setup.py.
>>>
>>> In case you want to try this package, there is a manual at
>>> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf
>>> (replace "from Pycluster import *" by "from Bio.Cluster import *")
>>> and a sample data set at
>>> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/demo.txt.
>>> Please let me know if you find any problems with this package.
>>>
>>> --Michiel.
>>>
>>
>
--
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 646 3171
http://ffas.ljcrf.edu/~iddo
More information about the Biopython-dev
mailing list