[Biopython] NJ tree constructor never completes

Alexey Morozov alexeymorozov1991 at gmail.com
Wed Aug 16 07:32:24 UTC 2017


A year or two ago I have already written a numpy-based implementation of
distance matrix precisely because biopython's one was both slow and
inconvenient. It's available at
https://github.com/SynedraAcus/sampler/blob/master/reductor.py (see
`DistanceMatrix` class)

What's fixed:
- Sequence IDs can be given in any order. Those `dm[j,i] if j>=i else
dm[j,i]` calls are annoying, and that gets even uglier (and damn slow, see
next point) with str sequence IDs.
- Dict, not list, for the sequence ID/index mapping. The current
implementation stores names in a list and iterates over it 4 (four!) times
every single time __getitem__ is called with the string IDs.
- Numpy storage for a reduced memory consumption
- A method for creating submatrices given a list of sequence IDs (probably
pretty situational)
- Matrix factory, a separate class, takes FASTA file, not an alignment
(very situational, needs to be expanded)

It's not compatible with other Biopython modules that use distances (ie
Phylo, distance calculations), so I didn't send a pull request. If you ever
decide to overhaul this part of project, please feel free to use my code.
It's been licensed under CC-BY 4.0, but I can relicense it if you want.

The overhaul actually is necessary: it's all pretty slow with more than a
few tens of sequences and the API sometimes is counterintuitive. Maybe a
MatrixIO class and matrix subclasses would be necessary. But I personally
haven't even touched other modules and can't promise I'll get to it any
time soon.

2017-08-16 4:31 GMT+08:00 Eric Talevich <eric.talevich at gmail.com>:

> Hi Andrew,
>
> If you have a DistanceMatrix object handy, you can export it to Phylip's
> matrix format with the method "format_phylip":
>
> dm = DistanceMatrix(...)
> with open("my_file.phylip", "w") as f:
>     dm.format_phyilp(f)
>
> Then you can feed this file to another, faster program like Fasttree or
> Phylip's "neighbor" program.
>
> This module doesn't use Numpy internally because it was written at a time
> when Biopython avoided having Numpy as a dependency. It would probably be a
> useful improvement to port it to Numpy or Scipy now.
>
> -Eric
>
> On Tue, Aug 15, 2017 at 9:26 AM, Andrew Sanchez <aas229 at nau.edu> wrote:
>
>> Here are the results.  The run time scales quickly, indeed!  It would
>> take days to generate a tree for a 6000 x 6000 matrix.  I should have
>> remembered this from my introduction to bioinformatics course.
>>
>> In [35]: for num in range(100, 1001, 100):
>>     ...:     print('Matrix size: %s' % num)
>>     ...:     %time gbf.estimate_tree_constructor_runtime(num)
>>     ...:
>> Matrix size: 100
>> Wall time: 2.36 s
>> Matrix size: 200
>> Wall time: 18.2 s
>> Matrix size: 300
>> Wall time: 1min 1s
>> Matrix size: 400
>> Wall time: 2min 25s
>> Matrix size: 500
>> Wall time: 4min 43s
>> Matrix size: 600
>> Wall time: 8min 8s
>> Matrix size: 700
>> Wall time: 13min 1s
>> Matrix size: 800
>> Wall time: 19min 15s
>> Matrix size: 900
>> Wall time: 27min 32s
>> Matrix size: 1000
>> Wall time: 37min 50s
>>
>> > It may be you would be better off with a compiled command line
>> > phylogenetic tree for work at this scale
>>
>> Thanks for the tip!  Would you kindly point me in the right direction to
>> begin understand how I can accomplish this?
>>
>> Thank you,
>> Andrew
>>
>> > On Aug 14, 2017, at 5:31 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>> >
>> > Hi Andrew,
>> >
>> > My guess is you are simply seeing the quadratic scaling being a
>> > problem. Can you try timing a series of subsets, say 10 entries, 50,
>> > 100, 200, 250, 500, 1000 - that approach ought to be enough to
>> > estimate how long the full 6000 or so would take.
>> >
>> > It may be you would be better off with a compiled command line
>> > phylogenetic tree for work at this scale?
>> >
>> > Peter
>> >
>> > On Mon, Aug 14, 2017 at 4:21 PM, Andrew Sanchez <aas229 at nau.edu> wrote:
>> >> I am trying to construct a tree from a DistanceMatrix object with len
>> of 6303 with the following command:  `tree = constructor.nj(bio_dmx)`.
>> >>
>> >> The matrix and constructor were derived like so:
>> >>
>> >> bio_dmx = _DistanceMatrix(names, nested_dmx)
>> >> constructor = DistanceTreeConstructor()
>> >>
>> >> I've tested my workflow on a much smaller distance matrix, just
>> following the examples at http://biopython.org/wiki/Phylo and it worked
>> just fine.  When I try to do it with this larger dataset, the process just
>> hangs.  I don't know where to begin debugging.  First of all, how long
>> should I expect this process to take?  From wikipedia:  “...typical run
>> times proportional to approximately the square of the number of taxa."
>> >>
>> >> Maybe it is normal for a tree of this size to take so long to
>> construct?  If so, is there a way to run tree = constructor.nj(bio_dmx) so
>> that it produces some output that will allow me to at least see that
>> something is happening?
>> >>
>> >> I was trying to do this in an IPython session, and eventually I just
>> cancelled the process which had been going for about 48 hours.  The result
>> of the keyboard interrupt was:
>> >>
>> >> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-pack
>> ages/Bio/Phylo/TreeConstruction.py in nj(self, distance_matrix)
>> >>   697                 node_dist[i] = 0
>> >>   698                 for j in range(0, len(dm)):
>> >> --> 699                     node_dist[i] += dm[i, j]
>> >>   700                 node_dist[i] = node_dist[i] / (len(dm) - 2)
>> >>   701
>> >>
>> >> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-pack
>> ages/Bio/Phylo/TreeConstruction.py in __getitem__(self, item)
>> >>   166                 raise TypeError("Invalid index type.")
>> >>   167             # check index
>> >> --> 168             if row_index > len(self) - 1 or col_index >
>> len(self) - 1:
>> >>   169                 raise IndexError("Index out of range.")
>> >>   170             if row_index > col_index:
>> >>
>> >> /home/aas229/anaconda3/envs/gbfilter/lib/python3.4/site-pack
>> ages/Bio/Phylo/TreeConstruction.py in __len__(self)
>> >>   284     def __len__(self):
>> >>   285         """Matrix length"""
>> >> --> 286         return len(self.names)
>> >>   287
>> >>   288     def __repr__(self):
>> >>
>> >> Does this output suggest that the job was in fact running just fine,
>> but just taking a really long time?
>> >>
>> >> Is there any other info that would be helpful in figuring this out?
>> >>
>> >> Thank you,
>> >> Andrew
>> >> _______________________________________________
>> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> >> http://mailman.open-bio.org/mailman/listinfo/biopython
>>
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython
>>
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
>



-- 
Alexey Morozov,
LIN SB RAS, bioinformatics group.
Irkutsk, Russia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20170816/ae90e1a6/attachment.html>


More information about the Biopython mailing list