[BioPython] Spatial clustering

Tue Oct 14 11:16:00 EDT 2003

Dear all,

thanks for all the inputs!
I am new to this field and came from a bio background so I am not that 
familiar with computer sciences. The project, however, was there for 1 
year and had shown great results for some enzymes we tested:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=14499612&dopt=Abstract

The basic idea is to use organic solvents as "probes" and use energy 
function to find the favorable minimums. We first used a simplex method 
with Van der Waals cancellation and then do the further minimization 
using CHARMm. Through some testing we've found a 6660 positions of the 
probes would bring the best results. By clustering those molecules and 
calculating the average free energy for each we can come up with top 5 
energy favorable clusters. It was shown that the "concensus" site of the 
clusters of different probes is the binding site of the protein.

Actually the cluster code is already there and is written in C. The 
person who wrote both the mapping program and the clustering program had 
already left this lab. Originally I was working on the concensus site 
finding part, which was done by manual inspection in Rasmal or PyMol in 
the past, but later thought that it might be more efficient if I wrap 
these two parts together. To me creating a valid RMSD matrix seems to be 
as important as the algorithym for clustering. For instance, the small 
molecules we used ranges from methanol to t-butanyl, and for the later 
two reference points might be needed. Finding the consensus sight might 
have more problems, since you are then dealing with different kinds of 
molecules. Any comments here?

Clustering seems to be an important issue when doing molecular 
modelling. People working on protein-protein docking in this lab all 
have some efforts in this though no collaborationg or a uniform method 
have been developed yet.

I have a naive questions about array/matrixes. Pairwise RMSD doesn't 
have direction, e.g. RMSD(1,2) == RMSD(2,1).
Therefore, the distance matrix would look like this:

      1      2    3      4      5
1   X     .2   .1    1.2   3.4
2   .2     X   .5      .2   .4
3    .1    .5    X     .6    .7
4   1.2   .2   .6      X    .2
5   3.4   .4   .7       .2    X

I've read the Numarray tutorial and there seems no special functions for 
matrixes that's symmetrical on the diagnol. Any more efficient approaches?
An algorithy in my mind is, starting with the RMSD matrix, first I would 
find the one with most neighbors, make it the hub of the cluster and 
take it out along with its memeber, then do the same thing recursively.

Dear Iddo,

I just checked cluto and would try to find if it's good for my purpose. 
thanks!

Dear Andrew,

I am not familiar with fingerprints or shape fiitting. Can you give me a 
place for start? I will search through google as well. I am not familiar 
with pharmacophore and will check it as well.

Dear Michiel,
I've read the PyCluster document and it seems that I had missed the 
point that the treecluster can let me specify the distance matrix 
myself. It might be the easiest solution. Thanks!

-shuhsien