[BioPython] Bio.Cluster - Howto, Documentation, exporting results

Wed Mar 26 04:19:13 UTC 2008

> Additionally, as of BioPython-1.44, there are a couple of things 
> mentioned in the documentation that are not available in Bio.Cluster.
> One of those is the Bio.Cluster.read function. I don't know if this is 
> because it was not yet in BioPython-1.44 or if the documentation is 
> outdated.

Some changes were made in Bio.Cluster in Biopython 1.45. These are largely cosmetic to make Bio.Cluster more consistent with other modules in Biopython. One of them is the read() function, which was added in Biopython 1.45. I have now updated the documentation for Bio.Cluster on the Biopython website; it corresponds to Biopython 1.45.

> I don't read the data from files, so I don't understand if DataFile 
> class is what I need, and if it is, how do I  make use of it.

> What I'm trying to do is to calculate the distances between some 
> multidimensional vectors and then cluster them. I managed to do that, 
> but then I don't know what to do with the Tree object I get. It's also 
> not obvious how do I keep track of which values in the Tree object 
> correspond to which entries in the distance matrix or in the original data.

The values in the Tree object, if non-negative, simply correspond to the row number in the distance matrix. If negative, they correspond to a node number. So if the Tree object is
[1, 2]  --> This is Node # -1
[-1,0]  --> This is node # -2
then first row 1 and row 2 in the distance matrix are joined, and then row 0 in the distance matrix is joined to the node [1,2].

> Is it possible to pass text in the original data so that it is used as 
> some sort of identifying header in later operations?

Instead of relying on the row numbers, you can also create an empty Bio.Cluster.Record object and fill this object with the data you have. Bio.Cluster.Record is essentially the same as Bio.Cluster.DataFile, just the name was changed for consistency with other Biopython modules. It may be a good idea to look at the documentation of Cluster 3 at
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/manual/index.html
to understand what all the fields in Bio.Cluster.Record are.

Another way is to construct a file in memory and let Bio.Cluster.read parse it.
>>> lines = "Start\tCol0\tCol1\tCol2\nRow0\t2.0\t1.2\t3.4\nRow1\t5.0\t6.2\t7.1\nRow2\t2.3\t5.6\t1.2\n"
>>> print lines
Start   Col0    Col1    Col2
Row0    2.0     1.2     3.4
Row1    5.0     6.2     7.1
Row2    2.3     5.6     1.2
>>> import StringIO
>>> handle = StringIO.StringIO(lines)
>>> record = Cluster.read(handle)
>>> tree = record.treecluster()

> How can I export the Tree object to something like the treeview format 
> mentioned in the documentation?

>>> record.save("myfilename", tree)

> Is there any way to visualize the tree directly using ASCII or something 
> more graphical?

Currently, there is no ASCII art -like representation to visualize the tree. So the easiest solution is to save the clustering solution in the treeview format, and use Java TreeView to visualize it.

--Michiel.

---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.