[Biopython-dev] Bio.Phylo to_adjacency_matrix function

Peter biopython at maubp.freeserve.co.uk
Tue Jan 19 10:47:39 EST 2010


On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Tue, Jan 19, 2010 at 5:49 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi Eric (and everyone else),
>>
>> I just spotted the to_adjacency_matrix function in utils:
>> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py
>>
>> The dostring says:
>>
>>> Create an adjacency matrix (NumPy array) from clades/branches in tree.
>>  >
>>> Also returns a list of all clades in tree ("allclades"), where the position
>>> of each clade in the list corresponds to a row and column of the numpy
>>> array. So, a cell i,j in the array represents the length of the branch from
>>> allclades[i] to allclades[j].
>>>
>>> @return: tuple of (allclades, adjacency_matrix) where allclades is a list
>>> and adjacency_matrix is a NumPy 2D array.
>>
>> It looks like your adjacency matrix starts as a numpy array of zeros,
>> and then you sets some edges to branch lengths. How do you tell
>> apart a non-connection and a real connection of length zero? These
>> do occur, for example if you have three identical sequences, then
>> you might expect a single node with three children. However IIRC,
>> in (some) NJ trees each node has two children by construction,
>> so you get an extra node connected with a branch of length zero.
>
> Shoot, you're right. I can think of three reasonable mitigations:
> (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate
> adjacency -- this seems more standard in textbooks, actually.
> (b) Issue a warning or raise an error if the given tree contains a
> 0-length branch.
> (c) Delete the function.
>
> Which do you recommend?
>
> The idea was to give mathematicians something to play with. For
> example, Chapter 2 of this report represents phylogenies this way,
> using 0 or 1 to indicate the presence of a branch:
> http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf
>
> Thanks for the heads-up,
> Eric

I did wonder about further options,

(d) Since the distances are floats, we can use a NA as
a flag for no connection. However, this does not seem
very useful.

(e) Collapse nodes separated by a zero length branch
while building the adjacency matrix.

Or, raise an error (b) but provide a tree method to collapse
nodes separated by a zero length branch which could be
called to "clean up" a problematic tree before making the
adjacency matrix.

None of these options seem ideal :(

I would say the boolean matrix (a) is safe but is of limited utility.
Therefore (c), remove the function for now is probably best. It
can always be re-added in a later release if a good solution is
agreed.

Peter

P.S. Another potentially interesting thing would be a matrix using
the bootstrap support values (where again you have a problem
with zero bootstrap support vs no connection). I'm not sure if this
has any practical uses though.



More information about the Biopython-dev mailing list