[Biopython] deleting in-group paralogs from newick trees

Eric Talevich eric.talevich at gmail.com
Mon Aug 8 19:33:52 UTC 2011


On Mon, Aug 8, 2011 at 2:08 PM, Jessica Grant <jgrant at smith.edu> wrote:

> Hello,
>
> I am looking at large phylogenetic trees that have many paralogs.  I would
> like to simplify my trees so that all monophyletic paralog groups are
> collapsed--or all sequences except the shortest branch are deleted.  Is
> there a Biopython module that can help?  I started looking at Phylo, but
> couldn't see an obvious way.
>

Hi Jessica,

Yes, Phylo is the right module to use. If I understand your problem
correctly, the tree methods you want are is_monophyletic() and
collapse_all(). Both operate on a clade within the tree. You'd traverse the
tree with get_nonterminals(), check if a paralog group under a clade is
monophyletic, and if so, collapse it.

Do you have a list of paralogs already? And, do you know which groups might
be monophyletic?

If you have groups/clades already, it's simple:

>>> tree = Phylo.read('mytree.nwk', 'newick')
>>> for clade in tree.get_nonterminals(order='postorder'):
...     mono_parent = clade.is_monophyletic([SOME_PARALOG_GROUP])
...     if mono_parent:
...         mono_parent.collapse_all()

If you don't know the groups yet, then the test inside the loop is a little
more elaborate. You can look for overlaps between a clade's tips and and the
paralog list using sets:

>>> paralogs = set(PARALOG_LIST)
# Inside the loop:
>>> tips = set([str(t) for t in clade.get_terminals()])
>>> overlap = tips.intersect(paralogs)
>>> if len(overlaps) >= 2:
# The rest of the loop...


Hope that helps,
Eric



More information about the Biopython mailing list