[Biopython] NJ tree constructor never completes

Eric Talevich eric.talevich at gmail.com
Tue Aug 15 20:31:38 UTC 2017


Hi Andrew,

If you have a DistanceMatrix object handy, you can export it to Phylip's
matrix format with the method "format_phylip":

dm = DistanceMatrix(...)
with open("my_file.phylip", "w") as f:
    dm.format_phyilp(f)

Then you can feed this file to another, faster program like Fasttree or
Phylip's "neighbor" program.

This module doesn't use Numpy internally because it was written at a time
when Biopython avoided having Numpy as a dependency. It would probably be a
useful improvement to port it to Numpy or Scipy now.

-Eric

On Tue, Aug 15, 2017 at 9:26 AM, Andrew Sanchez <aas229 at nau.edu> wrote:

> Here are the results.  The run time scales quickly, indeed!  It would take
> days to generate a tree for a 6000 x 6000 matrix.  I should have remembered
> this from my introduction to bioinformatics course.
>
> In [35]: for num in range(100, 1001, 100):
>     ...:     print('Matrix size: %s' % num)
>     ...:     %time gbf.estimate_tree_constructor_runtime(num)
>     ...:
> Matrix size: 100
> Wall time: 2.36 s
> Matrix size: 200
> Wall time: 18.2 s
> Matrix size: 300
> Wall time: 1min 1s
> Matrix size: 400
> Wall time: 2min 25s
> Matrix size: 500
> Wall time: 4min 43s
> Matrix size: 600
> Wall time: 8min 8s
> Matrix size: 700
> Wall time: 13min 1s
> Matrix size: 800
> Wall time: 19min 15s
> Matrix size: 900
> Wall time: 27min 32s
> Matrix size: 1000
> Wall time: 37min 50s
>
> > It may be you would be better off with a compiled command line
> > phylogenetic tree for work at this scale
>
> Thanks for the tip!  Would you kindly point me in the right direction to
> begin understand how I can accomplish this?
>
> Thank you,
> Andrew
>
> > On Aug 14, 2017, at 5:31 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > Hi Andrew,
> >
> > My guess is you are simply seeing the quadratic scaling being a
> > problem. Can you try timing a series of subsets, say 10 entries, 50,
> > 100, 200, 250, 500, 1000 - that approach ought to be enough to
> > estimate how long the full 6000 or so would take.
> >
> > It may be you would be better off with a compiled command line
> > phylogenetic tree for work at this scale?
> >
> > Peter
> >
> > On Mon, Aug 14, 2017 at 4:21 PM, Andrew Sanchez <aas229 at nau.edu> wrote:
> >> I am trying to construct a tree from a DistanceMatrix object with len
> of 6303 with the following command:  `tree = constructor.nj(bio_dmx)`.
> >>
> >> The matrix and constructor were derived like so:
> >>
> >> bio_dmx = _DistanceMatrix(names, nested_dmx)
> >> constructor = DistanceTreeConstructor()
> >>
> >> I've tested my workflow on a much smaller distance matrix, just
> following the examples at http://biopython.org/wiki/Phylo and it worked
> just fine.  When I try to do it with this larger dataset, the process just
> hangs.  I don't know where to begin debugging.  First of all, how long
> should I expect this process to take?  From wikipedia:  “...typical run
> times proportional to approximately the square of the number of taxa."
> >>
> >> Maybe it is normal for a tree of this size to take so long to
> construct?  If so, is there a way to run tree = constructor.nj(bio_dmx) so
> that it produces some output that will allow me to at least see that
> something is happening?
> >>
> >> I was trying to do this in an IPython session, and eventually I just
> cancelled the process which had been going for about 48 hours.  The result
> of the keyboard interrupt was:
> >>
> >> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-
> packages/Bio/Phylo/TreeConstruction.py in nj(self, distance_matrix)
> >>   697                 node_dist[i] = 0
> >>   698                 for j in range(0, len(dm)):
> >> --> 699                     node_dist[i] += dm[i, j]
> >>   700                 node_dist[i] = node_dist[i] / (len(dm) - 2)
> >>   701
> >>
> >> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-
> packages/Bio/Phylo/TreeConstruction.py in __getitem__(self, item)
> >>   166                 raise TypeError("Invalid index type.")
> >>   167             # check index
> >> --> 168             if row_index > len(self) - 1 or col_index >
> len(self) - 1:
> >>   169                 raise IndexError("Index out of range.")
> >>   170             if row_index > col_index:
> >>
> >> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-
> packages/Bio/Phylo/TreeConstruction.py in __len__(self)
> >>   284     def __len__(self):
> >>   285         """Matrix length"""
> >> --> 286         return len(self.names)
> >>   287
> >>   288     def __repr__(self):
> >>
> >> Does this output suggest that the job was in fact running just fine,
> but just taking a really long time?
> >>
> >> Is there any other info that would be helpful in figuring this out?
> >>
> >> Thank you,
> >> Andrew
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
> >> http://mailman.open-bio.org/mailman/listinfo/biopython
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20170815/b838cd37/attachment-0001.html>


More information about the Biopython mailing list