[Biopython] Request from help

Thu Apr 11 13:20:14 EDT 2013

Hi Peter and Paulo,

Thank you for your feedback, much appreciated! I still have very sparse
knowledge about phylogenies, and especially the run times needed to build
the trees, so any new knowledge is appreciated!

The sequences I'm using are full Influenza A HA protein sequences, so we're
talking about 1700-1750 amino acids being aligned together. The multiple
sequence alignment for 70 sequences doesn't take long - on the order of
minutes on my laptop. It's the "feeding into PhyML" portion that, for some
reason, takes a long time.

With that said, I do have a full distance matrix as one of the outputs from
a previous script in this script series, in addition to the multiple
sequence alignment. I have been able to feed the distance matrix into a
separate clustering algorithm from scikit-learn, and I was able to
successfully identify six clusters of sequences in there. Hence, I wanted
to use a phylogenetic tree to confirm what I'm seeing with the clustering
algorithm - it's basically two separate representations of the same data.

I have heard that it is possible to create a tree from the distance matrix,
and I was thinking this might be an alternative to feeding the alignment
into PhyML. Does anybody know how to do this using BioPython?

Cheers,
Eric
-----------------------------------------------------------------------
Please consider the environment before printing this e-mail. Do you really
need to print it?

http://about.me/ericmjl

On Thu, Apr 11, 2013 at 1:11 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, Apr 11, 2013 at 5:49 PM, Eric Ma <ericmajinglong at gmail.com> wrote:
> > Hello everybody,
> >
> > I'm new to the mailing list here, though I've been playing with BioPython
> > for quite a while.
> >
> > I'm having some trouble here. I wanted to display a tree of sequences for
> > which I had done a multiple sequence alignment. I tried going through the
> > pipeline example here (http://biopython.org/wiki/Phylo#Example_pipeline
> ).
> > Because I'm still in the testing phase, instead of writing it as a single
> > script, I wrote it as a series of scripts that I would execute in order.
> >
> > The problem I run into is at step 4 in the example, where I "feed the
> > alignment to PhyML". My data set is 70 protein sequences, and the
> trouble I
> > run into is that it takes a very, very long time at the "feeding
> alignment
> > to PhyML" step. I tried running the script on my MacBook Pro overnight,
> and
> > even the next morning it was not done. Am I missing something here?
> >
> > Just to be clear here, aligning the sequences using Muscle was
> successful,
> > and I also managed to output a distance matrix from sample to sample,
> which
> > I used in another downstream pipeline to display the clustering of the
> > sequences on a 2D euclidean plane. However, I wanted to have a tree
> > representation to validate the clustering results; the trouble is, I
> can't
> > get the _phyml_tree.txt file to be created, which I would then use to
> draw
> > the tree.
> >
> > Thanks in advance for any help!
> >
> > Cheers,
> > Eric
>
> Hi Eric,
>
> So this part is getting stuck (or taking a very long time):
>
> #Feed the alignment to PhyML using the command line wrapper:
> from Bio.Phylo.Applications import PhymlCommandline
> cmdline = PhymlCommandline(input='egfr-family.phy', datatype='aa',
> model='WAG', alpha='e', bootstrap=100)
> out_log, err_log = cmdline()
>
> At that point is the computer active (high CPU load as measured
> via the task manager / system monitor / top / etc)?
>
> I would suggest trying PHYML at the command line by hand, first
> check the command the Biopython should be running:
>
> print cmdline
>
> That may give you visual progress on screen. My guess is simply
> that this is just slow - you are only running 100 bootstraps, but
> perhaps each one is taking a while and that adds up.
>
> You said you had 70 protein sequences - how many columns
> are there in the alignment? That can also affect run times.
>
> Peter
>