[Bioperl-l] Getting started with Bio::TreeIO

Jason Stajich jason@cgt.mc.duke.edu
Sun, 16 Jun 2002 22:30:37 -0400 (EDT)


Serendipity?  I have just started (this week) to refocus on my initial
efforts on the Tree modules to build more functionality as part of a push
to build evolutionary and comparative genomics tools in Bioperl.

More to details as to the exact plan when I get a little more in the thick
of it.  The first part that was committed last week - a wrapper around one
of the PAML programs (codeml) to calculate Ka,Ks,and Ka/Ks values.

On Mon, 17 Jun 2002, Howard Ross wrote:

> Dear folks,
> I would like to be able to read in a phylogenetic tree, manipulate its
> branching structure by repositioning clades, and then write it to file.
> The Bio::Tree modules seem the obvious tool. I'm have trouble getting
> started and would appreciate a bit of help.
>
> In working with the Bio::Tree modules recently, I have deduced:
> - they do not support Newick trees WITHOUT branch lengths,
> although this is a legitimate format, and

Good point - bug fix checked into main trunk tonight.  Can now read/write
newick w/ and w/o branch labels.  I also added a new made-up format that I
call Bio::TreeIO::tabtree which prints trees in an ASCII drawing of sorts.
Very simple until I hook up the drawgram connection to the PHYLIP wrapper
(which is on the TODO list).

Contact me off the list if you are uncomfortable with CVS checkouts of the
code - however this is going to have to be bleeding edge code for at least
the summer.

> - ancestor is defined only for rooted trees (this makes sense)
> Can I suggest that these get put into the POD sections, to save the
> next person's time.
>

Sure thing.  We should talk more about what should be expected behaviors
for different types of trees and what expanded list of questions we want
to address with trees.

> I have successfully read in a tree, and then written it out to file.
> But, I have been totally unsuccessful at making any changes to the
> tree and then saving it to file.
>

I think that I ran into a bit of trouble implementing this and probably
never fixed it - spent a little time on it tonight, see below for
comments.  Thank you for a REALLY GOOD BUG REPORT since it is easy for me
to get in and start playing with the bugs.

> The following piece of code should swap nodes C and G in the tree
> (shown below) but it only recreates the input tree.
> #===============================================================
> #!/usr/bin/perl
>
> use Bio::TreeIO;
>
> my $treeio = new Bio::TreeIO(-format => 'newick', -file =>
> 'rooted8tree.tre');
> my $treeout = new Bio::TreeIO(-format => 'newick', -file => ">saveme.txt" );
> my $tree = $treeio->next_tree;
> my @nodes = $tree->get_nodes;
>

> my $i, $c, $g;
>
> for ($i = 0; $i <= $#nodes; $i++) {
>     if ($nodes[$i]->id eq 'C') {
>     $c = $i;
>     }
>     if ($nodes[$i]->id eq 'G') {
>     $g = $i;
>     }
> }
> my $cancestor = $nodes[$c]->ancestor;
> my $gancestor = $nodes[$g]->ancestor;
> $nodes[$c]->ancestor($gancestor);
> $nodes[$g]->ancestor($cancestor);
>
> $treeout->write_tree($tree);


The ancestor updates don't fix the fact that we have a reverse pointer for
the parent->child relationship which is what is read when the tree is
dumped out for the Tree writer - you have to update the child list of the
ancestor node to reflect the changes you want - the ancestor method is
basically a convience method which is kept up to date by the
add_Descendent method, I should probably make it read-only to prevent
these problems in the future (okay -its done now).

Perhaps I may have implemented the tree and node objects as more
complicated than they need to be?  I think it is justified given the
different capabilities we want.  If someone wants to suggest a better
implementation please code it up and propose it.

Here are the capabilities I think that Trees and their Nodes need:

 * Represent Polytomys (>2 children per node).

   So we use a hash at each Node to represent the pointers to all of
   its children.  The has is updated through the methods
   add_Descendent().  To retrieve all the immediete children
   call each_Descendent().  To get all the children and all of their
   children (a recursive call), use get_Descedents().

 * Be able to get the ancestor (parent) of a child node.

   The add_Descendent method also does some housekeeping for us - it
   updates the ancestor pointer.  This is accessed through the
   ancestor() method.

 * Store branch length, description text, and a user readable id.
   Additionally have an associated unique id for a node.

 * Remove a child from one node and add it as a child to another node.
   (As brought up by Howard).  I added the methods
   remove_all_Descendents() which wipes out a node's list of children
   (but does not necessarily DESTROY them if you have a ptr handle to
   them elsewhere).  And the method remove_Descendent($node) which
   will remove the requested child from a node.  (This is based on the
   the internal_id so implementing a db store for trees will need to
   be sure and properly store an id for Nodes).

So to do what you want to do - which is move a node to a different clade -
we need to do a couple of things - remove it from the list of children for
the parent node of that clade, and add it as a child to the head node of
the second clade you want it in.  I have added your code in the t/TreeIO.t
test by reading in your example file and providing the tests.

hmm - the anonymous server is down temporarily or I'd post the link to the
code, we'll get that back up soon.  Anyways, it basically looks like this:
(Set the $vebose flag to 1 if you want to see the tree printed out).

$treeio = new Bio::TreeIO(-format => 'newick',
                          -fh => \*DATA);
my $treeout = new Bio::TreeIO(-format => 'tabtree');
my $treeout2 = new Bio::TreeIO(-format => 'newick');

$tree = $treeio->next_tree;

if( $verbose > 0  ) {
    $treeout->write_tree($tree);
    $treeout2->write_tree($tree);
}
@nodes = $tree->get_nodes;

my( $i, $c, $g);

for ($i = 0; $i <= $#nodes; $i++) {
    next unless defined $nodes[$i]->id;
    if ($nodes[$i]->id eq 'C') {
        $c = $i;
    }
    if ($nodes[$i]->id eq 'G') {
        $g = $i;
    }
}
$nodes[$c]->ancestor;
$nodes[$g]->ancestor;
my $cancestor = $nodes[$c]->ancestor;
my $gancestor = $nodes[$g]->ancestor;
$cancestor->id('C-ancestor'); # let's provide a way to test if we suceeded
$gancestor->id('G-ancestor'); # in our swapping

$cancestor->remove_Descendent($nodes[$c]);
$gancestor->remove_Descendent($nodes[$g]);
$cancestor->add_Descendent($nodes[$g],1);
$gancestor->add_Descendent($nodes[$c],1);

@nodes = $tree->get_nodes();

for ($i = 0; $i <= $#nodes; $i++) {
    next unless defined $nodes[$i]->id;
    if ($nodes[$i]->id eq 'C') {
        # test that we swapped nodes properly
        ok($nodes[$i]->ancestor->id, 'G-ancestor');
        $c = $i;
    }
    if ($nodes[$i]->id eq 'G') {
        $g = $i;
        # test that we swapped nodes properly
        ok($nodes[$i]->ancestor->id, 'C-ancestor');
    }
}

if( $verbose > 0  ) {
    $treeout2->write_tree($tree);
}

__DATA__
(((A:1,B:1):1,(C:1,D:1):1):1,((E:1,F:1):1,(G:1,H:1):1):1);



> #==============================================================
> -rooted8tree.tre-
> (((A:1,B:1):1,(C:1,D:1):1):1,((E:1,F:1):1,(G:1,H:1):1):1);
>
> I have also tried to rearrange the tree by adding nodes and defining
> the descendents, again without success.
>
> So, the big question is how do you rearrange nodes in these trees:
>
> - To specify a node-node relationship, do you need to specify both the
>   ancestor and the descendent fields of the respective nodes?

Just need to update the descendents as described above as the ancestor ptr
is updated by the add_Descendent method
>
> - Can (should) you have more than one tree in memory?
>
sure, there is no problem with that - if you are basing one tree off the
nodes of another you will need to make explicit copies of each node - we
probably need to think about a 'clone' method or something so that you
don't get into trouble making updates in one list of nodes which affects
the other list as well.

> - How do you specify which tree a particular node belongs to?
>
I think a tree is just a grouping of nodes - if you wanted to test if two
nodes are in the same tree - you would want to walkup the ancestor
pointers until you got to the root node and compare the internal_ids for
each of those nodes.


> - Do all nodes in a particular tree have to be defined in the same array?
>
no - you can add nodes after a tree is built - the nodes array is a
convient way to manipulate a tree when you know it is binary (see the
RandomTree generator in Bio::Tree).

> - Given the situation with ancestor and unrooted trees, how do you
>   (re)define node-node relationships for this type of tree?

hmm - so the definition of an unrooted tree is on which would have > 1
nodes with no ancestor.   It is quite fine to do the rearranging as I
describe above - the problems come when you want to test if two nodes are
in the same tree.  I'm quite sure this is covered in the CS literature -
I'll leaf through my Gusfield text and see what he covers wrt to
phylogenetic trees.


Hope this is the beginning of a good answer - I'll try and convert this
commentary into better documentation as time permits (or if someone else
wants to contribute this I'll gladly step aside!).


-j

>
> Your help would be much appreciated.
>
> Howard
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu