[Bioperl-l] taxonomy and speices

Thu Aug 28 08:49:12 EDT 2003

Juguang,

It sounds like you've formulated a solution, but let me describe another
approach, if only to plug BioSQL for those who haven't installed it. With a
BioSQL database installed one can run Aaron's load_taxononomy.pl (found in
the biosql package), this loads the current taxonomy data from NCBI. There
you'll find each taxon labeled by name ("Arabidopsis"), node_rank ("genus"),
and parent_taxon_id. Yes, this approach is a bit more "mechanical" than
yours but a straightforward script will get both the "full path" or the
children from the database. Sidelight: see
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html for Aaron's
nice article on the meaning of the right_value and left_value fields.

If you do write the code you've suggested please send the final script, it
sounds like a good one for our examples/ directory.

Brian O.

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Juguang Xiao
Sent: Thursday, August 28, 2003 5:55 AM
To: bioperl-l at bioperl.org
Subject: [Bioperl-l] taxonomy and speices

Hi guys,

I tried to write a simple bioperl-db scripts functioning like the
search on http://www.ncbi.nih.gov/Taxonomy/taxonomyhome.html/ , to
return a full taxonomy path, and all sub taxonomy nodes. Say, If I
search 'mouse', it will return the full path as

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus; mouse

And all sub taxonomy nodes will be also returned, like 'asian house
mouse', 'european house mouse', etc.

However, the Guru Hilmar told me that current bioperl-db works on
Bio::Species, but not Bio::Taxonomy, and now bioperl-db cannot satisfy
my above requirement until the code will adapt Taxomony after Taxonomy
replaces Species. Hence I investigate the species-related modules,
found some puzzles and would like to volunteer the idea and the code.

Bio::Taxonomy is written by Dan Kortschak, and the main and only
functional method (rather than get/set, I mean), 'classify', is to
convert a Species object into an array of names. It wastes such nice
module name ;-)

Jason wrote Bio::Taxonomy::Node, and Bio::DB::Taxonomy which access
NCBI Entrez over HTTP OR read the NCBI Tax dump files.
Bio::Taxonomy::Node is tied to Bio::DB::Taxonomy closely, hence it
objects to be adapted in bioperl-db system so easily.

My plan to reform them is described below.

DATA STRUCTURE
Taxonomy should be abstracted as a hash with the keys as rank names,
such as 'class', 'genus', and values as the identifiers, such as NCBI
taxid, scientific name or Taxonomy::Node object.

$taxonomy = {
        '_rank' => ['root', 'superkingdom', ..., 'species', 'subspecies'...,
'no rank'], # copied from the current Taxonomy module.

        '_hierarchy' => { # Though the keys are unordered in this hash, its
order is defined in rank.
                ...
                'class' => 40674, # or mammlia, or the Taxonomy::Node
                'genus' => 'Mus',
                'species' => $tax_Node_musculus
                ....
        },
        '_factory' => $factory, # explained later.
};

NOTE: the new taxonomy can represent more than species level, e.g. it
is flexible to represent a object at genus level without species.

$taxNode_mammalia = {
        'object_id' => 40674, # NCBI taxid, and the reason why it is called
'object_id' for the consistence to Bio:;IdentifiableI
        'rank' => 'class',
        'name' => 'Mammalia', # scientific name
        'common_name' => 'mammals', # Genbank common name, as NCBI site uses
the term.
        'alias' => { # a hash with name_class as key and variant name as
value
                '' => ''
        },
        '_factory' => $factory
};

$taxNode_mouse = {
        'object_id' => 10090,
        'rank' => 'species',
        'names' => { # This is a general solution!!
                'specific' => ['musclus'],
                'common' => ['mouse', 'Mickey'],
                'includes' => ['nude mice']
        }
};

OBJECTS

Bio::Taxonomy will override all methods in Bio::Species, for the sake
of backwards compatibility. If the tax object represents a level higher
than species, the sub 'binomial' returns undef, otherwise simple make
the result by combining the species and genus; the sub 'classification'
will look like "

foreach(@ranks){
        unshift @classification, $taxonomy{$_} if defined exists
$taxonomy{$_}
}

Bio::Taxonomy::Node has NO reference to either the parent node or
taxonomy object, so that Node objects can be freely shared among
Taxonomy. Tricky: once a Node object is created, it should be changed
on its content. If a Taxonomy requires one of its Nodes modified, it
has to make a new Node, in case that Node was shared by other Taxonomy.

Definitely, we need a Taxonomy factory, like Jason's Bio::DB::Taxonomy
or what we are going to create in bioperl-db. Both Taxonomy objects and
Node ones have a reference to this factory, so that Taxonomy can be
created automatically, and Node can ask who his parent is,
($node->get_parent_node, e. g.
$node->_factory->find_parent_node($node)).

Comments, please, and I will transform the idea into the code.

Thanks.

Juguang

------------ATGCCGAGCTTNNNNCT--------------
Juguang Xiao
Bioinformatics Engineer
Temasek Life Sciences Laboratory, National University of Singapore
1 Research Link,  Singapore 117604
fax: (+65) 68727007

juguang at tll.org.sg

_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l