[Bioperl-l] Bio::*Taxonomy* changes (Chris Fields)
Gabriel Valiente
valiente at lsi.upc.edu
Fri Jul 28 16:35:20 UTC 2006
>> It is a general algorithm I devised that takes a set of paths and
>> builds up a tree. The input paths are full lineages coming from
>> Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why
>> I said I don't know exactly where it would belong, it looks to me
>> more like a standalone script than a Bio::Taxonomy or Bio::Tree
>> method.
>>
>> Gabriel
>
> Agreed. You could submit the script as an example here if it is
> short, or
> via Bugzilla as an enhancement request:
>
> http://bugzilla.open-bio.org/
>
> It could be added to the scripts\ or examples\ directory in bioperl-
> core.
Here it is. Please check it and include for instance as
taxonomy2tree.PLS in the scripts/tree or scripts/taxonomy directory.
Disclaimer: I'm also publishing part of this code in a conference paper.
The script is already fully functional but anyway, I have a couple of
improvements in mind. The minor one is provision for cmdline input.
How would you like to input an array of names?
The other one is to remove internal node labels and contract
elementary paths, for instance reducing the tree:
(((((((((((((((((((((((((((("Pongo pygmaeus")Pongo,(("Gorilla
gorilla")Gorilla,("Pan troglodytes")Pan,("Homo sapiens")Homo)"Homo/
Pan/Gorilla group")Hominidae)Hominoidea)Catarrhini)Simiiformes)
Primates)Euarchontoglires)Eutheria)Theria)Mammalia)Amniota)Tetrapoda)
Sarcopterygii)Euteleostomi)Teleostomi)"Gnathostomata <vertebrate>")
Vertebrata)"Craniata <chordata>")Chordata)Deuterostomia)Coelomata)
Bilateria)Eumetazoa)Metazoa)"Fungi/Metazoa group")Eukaryota)"cellular
organisms")root;
to the tree:
("Pongo pygmaeus",("Gorilla gorilla","Pan troglodytes","Homo sapiens"));
It is certainly easy to remove all internal node labels. On the other
hand, I've been working on contraction of elementary paths for quite
a while, but always got stuck with internals of the Bio::Tree methods
to remove nodes. Thus, so far the only working code I have consists
of removing elementary branches while making a deep copy of the tree,
which certainly is not quite elegant...
Thanks a lot,
Gabriel
#!/usr/bin/perl -w
# Author: Gabriel Valiente <valiente at lsi.upc.edu>
# Purpose: Bio::DB::Taxonomy -> Bio::Tree::Tree
use strict;
use Bio::DB::Taxonomy;
use Bio::TreeIO;
my $nodesfile = "nodes.dmp";
my $namesfile = "names.dmp";
my $db = new Bio::DB::Taxonomy(-source => 'flatfile',
-directory => "./db/",
-nodesfile => $nodesfile,
-namesfile => $namesfile);
# the input to the script is an array of species names
my @species = ('Orangutan',
'Gorilla',
'Chimpanzee',
'Human');
my $root = new Bio::Tree::Node(-id => "root");
my $tree = new Bio::Tree::Tree(-root => $root);
# the full lineages of the species are merged into a tree
for my $name (@species) {
my $ncbi_id = $db->get_taxonid($name);
if ($ncbi_id) {
my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id);
my @lineage = get_lineage_nodes($node);
shift @lineage; # discard root
push @lineage, $node;
merge_path($root, \@lineage);
} else {
warn "no NCBI Taxonomy node for species ",$name,"\n";
}
}
# the tree is output in Newick format
my $output = new Bio::TreeIO(-format => 'newick');
$output->write_tree($tree);
# the actual merging of full lineages is performed by a recursive method
sub merge_path {
my $root = shift;
my $path = shift;
my @path = @{$path};
if (@path) {
my $top = shift @path;
my @children = grep { $_->id eq $top->node_name } $root-
>each_Descendent;
if (@children) {
# $root has a $child with id eq $top name
my $child = shift @children;
merge_path($child,\@path);
} else {
# add $top and @path below $root
my $node = $root;
unshift @path, $top;
while (@path) {
my $top = shift @path;
my $name = $top->node_name;
my $child = new Bio::Tree::Node(-id => "$name");
$node->add_Descendent($child);
$node = $child;
}
}
}
}
# the full lineage of a species is recovered by traversing the taxonomy
sub get_lineage_nodes{
my $node = shift;
my @lineage;
while ($node->node_name ne "root") {
$node = $node->get_Parent_Node;
unshift @lineage, $node;
}
return @lineage;
}
=head1 NAME
taxonomy2tree - builds a taxonomic tree based on the full lineages of
a set of species names
=head1 DESCRIPTION
This script requires that the bioperl-run pkg be also installed.
Providing the nodes.dmp and names.dmp files from the NCBI Taxonomy
dump (see Bio::DB::Taxonomy::flatfile for more info) is only necessary
on the first time running. This will create the local indexes and may
take quite a long time. However once created, these indexes will
allow fast access for species to taxon id OR taxon id to species name
lookups.
=cut
More information about the Bioperl-l
mailing list