[Biopython] NJ tree constructor never completes

Andrew Sanchez aas229 at nau.edu
Tue Aug 15 16:26:35 UTC 2017


Here are the results.  The run time scales quickly, indeed!  It would take days to generate a tree for a 6000 x 6000 matrix.  I should have remembered this from my introduction to bioinformatics course.

In [35]: for num in range(100, 1001, 100):
    ...:     print('Matrix size: %s' % num)
    ...:     %time gbf.estimate_tree_constructor_runtime(num)
    ...:
Matrix size: 100
Wall time: 2.36 s
Matrix size: 200
Wall time: 18.2 s
Matrix size: 300
Wall time: 1min 1s
Matrix size: 400
Wall time: 2min 25s
Matrix size: 500
Wall time: 4min 43s
Matrix size: 600
Wall time: 8min 8s
Matrix size: 700
Wall time: 13min 1s
Matrix size: 800
Wall time: 19min 15s
Matrix size: 900
Wall time: 27min 32s
Matrix size: 1000
Wall time: 37min 50s

> It may be you would be better off with a compiled command line
> phylogenetic tree for work at this scale

Thanks for the tip!  Would you kindly point me in the right direction to begin understand how I can accomplish this?

Thank you,
Andrew

> On Aug 14, 2017, at 5:31 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
> Hi Andrew,
> 
> My guess is you are simply seeing the quadratic scaling being a
> problem. Can you try timing a series of subsets, say 10 entries, 50,
> 100, 200, 250, 500, 1000 - that approach ought to be enough to
> estimate how long the full 6000 or so would take.
> 
> It may be you would be better off with a compiled command line
> phylogenetic tree for work at this scale?
> 
> Peter
> 
> On Mon, Aug 14, 2017 at 4:21 PM, Andrew Sanchez <aas229 at nau.edu> wrote:
>> I am trying to construct a tree from a DistanceMatrix object with len of 6303 with the following command:  `tree = constructor.nj(bio_dmx)`.
>> 
>> The matrix and constructor were derived like so:
>> 
>> bio_dmx = _DistanceMatrix(names, nested_dmx)
>> constructor = DistanceTreeConstructor()
>> 
>> I've tested my workflow on a much smaller distance matrix, just following the examples at http://biopython.org/wiki/Phylo and it worked just fine.  When I try to do it with this larger dataset, the process just hangs.  I don't know where to begin debugging.  First of all, how long should I expect this process to take?  From wikipedia:  “...typical run times proportional to approximately the square of the number of taxa."
>> 
>> Maybe it is normal for a tree of this size to take so long to construct?  If so, is there a way to run tree = constructor.nj(bio_dmx) so that it produces some output that will allow me to at least see that something is happening?
>> 
>> I was trying to do this in an IPython session, and eventually I just cancelled the process which had been going for about 48 hours.  The result of the keyboard interrupt was:
>> 
>> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-packages/Bio/Phylo/TreeConstruction.py in nj(self, distance_matrix)
>>   697                 node_dist[i] = 0
>>   698                 for j in range(0, len(dm)):
>> --> 699                     node_dist[i] += dm[i, j]
>>   700                 node_dist[i] = node_dist[i] / (len(dm) - 2)
>>   701
>> 
>> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-packages/Bio/Phylo/TreeConstruction.py in __getitem__(self, item)
>>   166                 raise TypeError("Invalid index type.")
>>   167             # check index
>> --> 168             if row_index > len(self) - 1 or col_index > len(self) - 1:
>>   169                 raise IndexError("Index out of range.")
>>   170             if row_index > col_index:
>> 
>> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-packages/Bio/Phylo/TreeConstruction.py in __len__(self)
>>   284     def __len__(self):
>>   285         """Matrix length"""
>> --> 286         return len(self.names)
>>   287
>>   288     def __repr__(self):
>> 
>> Does this output suggest that the job was in fact running just fine, but just taking a really long time?
>> 
>> Is there any other info that would be helpful in figuring this out?
>> 
>> Thank you,
>> Andrew
>> _______________________________________________
>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython




More information about the Biopython mailing list