[Biopython] NJ tree constructor never completes

Peter Cock p.j.a.cock at googlemail.com
Mon Aug 14 15:42:03 UTC 2017


Directly CC'ing the original author, Yanbo Ye.

I wonder if we can improve performance by taking better
advantage of NumPy here? Specifically should the distance
matrix be stored as an array which would allow vector based
calculations rather than for loops?

It looks like dm.matrix is just a Python list of lists...


Peter

On Mon, Aug 14, 2017 at 4:31 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Andrew,
>
> My guess is you are simply seeing the quadratic scaling being a
> problem. Can you try timing a series of subsets, say 10 entries, 50,
> 100, 200, 250, 500, 1000 - that approach ought to be enough to
> estimate how long the full 6000 or so would take.
>
> It may be you would be better off with a compiled command line
> phylogenetic tree for work at this scale?
>
> Peter
>
> On Mon, Aug 14, 2017 at 4:21 PM, Andrew Sanchez <aas229 at nau.edu> wrote:
>> I am trying to construct a tree from a DistanceMatrix object with len of 6303 with the following command:  `tree = constructor.nj(bio_dmx)`.
>>
>> The matrix and constructor were derived like so:
>>
>> bio_dmx = _DistanceMatrix(names, nested_dmx)
>> constructor = DistanceTreeConstructor()
>>
>> I've tested my workflow on a much smaller distance matrix, just following the examples at http://biopython.org/wiki/Phylo and it worked just fine.  When I try to do it with this larger dataset, the process just hangs.  I don't know where to begin debugging.  First of all, how long should I expect this process to take?  From wikipedia:  “...typical run times proportional to approximately the square of the number of taxa."
>>
>> Maybe it is normal for a tree of this size to take so long to construct?  If so, is there a way to run tree = constructor.nj(bio_dmx) so that it produces some output that will allow me to at least see that something is happening?
>>
>> I was trying to do this in an IPython session, and eventually I just cancelled the process which had been going for about 48 hours.  The result of the keyboard interrupt was:
>>
>> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-packages/Bio/Phylo/TreeConstruction.py in nj(self, distance_matrix)
>>    697                 node_dist[i] = 0
>>    698                 for j in range(0, len(dm)):
>> --> 699                     node_dist[i] += dm[i, j]
>>    700                 node_dist[i] = node_dist[i] / (len(dm) - 2)
>>    701
>>
>> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-packages/Bio/Phylo/TreeConstruction.py in __getitem__(self, item)
>>    166                 raise TypeError("Invalid index type.")
>>    167             # check index
>> --> 168             if row_index > len(self) - 1 or col_index > len(self) - 1:
>>    169                 raise IndexError("Index out of range.")
>>    170             if row_index > col_index:
>>
>> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-packages/Bio/Phylo/TreeConstruction.py in __len__(self)
>>    284     def __len__(self):
>>    285         """Matrix length"""
>> --> 286         return len(self.names)
>>    287
>>    288     def __repr__(self):
>>
>> Does this output suggest that the job was in fact running just fine, but just taking a really long time?
>>
>> Is there any other info that would be helpful in figuring this out?
>>
>> Thank you,
>> Andrew
>> _______________________________________________
>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list