[Biopython] Request from help

Thu Apr 11 17:33:05 UTC 2013

On 2013-04-11, at 1:20 PM, Eric Ma <ericmajinglong at gmail.com> wrote:

> Hi Peter and Paulo,
> 
> Thank you for your feedback, much appreciated! I still have very sparse
> knowledge about phylogenies, and especially the run times needed to build
> the trees, so any new knowledge is appreciated!
> 
> The sequences I'm using are full Influenza A HA protein sequences, so we're
> talking about 1700-1750 amino acids being aligned together. The multiple
> sequence alignment for 70 sequences doesn't take long - on the order of
> minutes on my laptop. It's the "feeding into PhyML" portion that, for some
> reason, takes a long time.

Alignment time is much smaller than any phylogeny calculation on your data size. The number of amino acids is not that important on the final time, as the ML is calculation is quite fast, but arranging the branches is the main bottleneck.

There's no easy solution for this, maybe you can try some other approaches, that won't be as good as ML (Neighbour Joning) and some that might be as good (Bayes) but take some time too.
> 
> With that said, I do have a full distance matrix as one of the outputs from
> a previous script in this script series, in addition to the multiple
> sequence alignment. I have been able to feed the distance matrix into a
> separate clustering algorithm from scikit-learn, and I was able to
> successfully identify six clusters of sequences in there. Hence, I wanted
> to use a phylogenetic tree to confirm what I'm seeing with the clustering
> algorithm - it's basically two separate representations of the same data.
> 

The distance can be used to generate a diagram, I wouldn't call it a phylogenetic tree, but it can give you some ideas. One quick way to check for your tree is to use Neighbour Joining approach, you can try Mega with your alignment file and see, calculations will be faster.

Cheers
Paulo

> I have heard that it is possible to create a tree from the distance matrix,
> and I was thinking this might be an alternative to feeding the alignment
> into PhyML. Does anybody know how to do this using BioPython?
> 
> Cheers,
> Eric
> -----------------------------------------------------------------------
> Please consider the environment before printing this e-mail. Do you really
> need to print it?
> 
> http://about.me/ericmjl
> 
> 
> On Thu, Apr 11, 2013 at 1:11 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> 
>> On Thu, Apr 11, 2013 at 5:49 PM, Eric Ma <ericmajinglong at gmail.com> wrote:
>>> Hello everybody,
>>> 
>>> I'm new to the mailing list here, though I've been playing with BioPython
>>> for quite a while.
>>> 
>>> I'm having some trouble here. I wanted to display a tree of sequences for
>>> which I had done a multiple sequence alignment. I tried going through the
>>> pipeline example here (http://biopython.org/wiki/Phylo#Example_pipeline
>> ).
>>> Because I'm still in the testing phase, instead of writing it as a single
>>> script, I wrote it as a series of scripts that I would execute in order.
>>> 
>>> The problem I run into is at step 4 in the example, where I "feed the
>>> alignment to PhyML". My data set is 70 protein sequences, and the
>> trouble I
>>> run into is that it takes a very, very long time at the "feeding
>> alignment
>>> to PhyML" step. I tried running the script on my MacBook Pro overnight,
>> and
>>> even the next morning it was not done. Am I missing something here?
>>> 
>>> Just to be clear here, aligning the sequences using Muscle was
>> successful,
>>> and I also managed to output a distance matrix from sample to sample,
>> which
>>> I used in another downstream pipeline to display the clustering of the
>>> sequences on a 2D euclidean plane. However, I wanted to have a tree
>>> representation to validate the clustering results; the trouble is, I
>> can't
>>> get the _phyml_tree.txt file to be created, which I would then use to
>> draw
>>> the tree.
>>> 
>>> Thanks in advance for any help!
>>> 
>>> Cheers,
>>> Eric
>> 
>> Hi Eric,
>> 
>> So this part is getting stuck (or taking a very long time):
>> 
>> #Feed the alignment to PhyML using the command line wrapper:
>> from Bio.Phylo.Applications import PhymlCommandline
>> cmdline = PhymlCommandline(input='egfr-family.phy', datatype='aa',
>> model='WAG', alpha='e', bootstrap=100)
>> out_log, err_log = cmdline()
>> 
>> At that point is the computer active (high CPU load as measured
>> via the task manager / system monitor / top / etc)?
>> 
>> I would suggest trying PHYML at the command line by hand, first
>> check the command the Biopython should be running:
>> 
>> print cmdline
>> 
>> That may give you visual progress on screen. My guess is simply
>> that this is just slow - you are only running 100 bootstraps, but
>> perhaps each one is taking a while and that adds up.
>> 
>> You said you had 70 protein sequences - how many columns
>> are there in the alignment? That can also affect run times.
>> 
>> Peter
>> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython