[Bioperl-l] taxonomy and speices

Juguang Xiao juguang at tll.org.sg
Thu Aug 28 05:55:01 EDT 2003


Hi guys,

I tried to write a simple bioperl-db scripts functioning like the 
search on http://www.ncbi.nih.gov/Taxonomy/taxonomyhome.html/ , to 
return a full taxonomy path, and all sub taxonomy nodes. Say, If I 
search 'mouse', it will return the full path as

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus; mouse

And all sub taxonomy nodes will be also returned, like 'asian house 
mouse', 'european house mouse', etc.

However, the Guru Hilmar told me that current bioperl-db works on 
Bio::Species, but not Bio::Taxonomy, and now bioperl-db cannot satisfy 
my above requirement until the code will adapt Taxomony after Taxonomy 
replaces Species. Hence I investigate the species-related modules, 
found some puzzles and would like to volunteer the idea and the code.

Bio::Taxonomy is written by Dan Kortschak, and the main and only 
functional method (rather than get/set, I mean), 'classify', is to 
convert a Species object into an array of names. It wastes such nice 
module name ;-)

Jason wrote Bio::Taxonomy::Node, and Bio::DB::Taxonomy which access 
NCBI Entrez over HTTP OR read the NCBI Tax dump files. 
Bio::Taxonomy::Node is tied to Bio::DB::Taxonomy closely, hence it 
objects to be adapted in bioperl-db system so easily.

My plan to reform them is described below.

DATA STRUCTURE
Taxonomy should be abstracted as a hash with the keys as rank names, 
such as 'class', 'genus', and values as the identifiers, such as NCBI 
taxid, scientific name or Taxonomy::Node object.

$taxonomy = {
	'_rank' => ['root', 'superkingdom', ..., 'species', 'subspecies'..., 
'no rank'], # copied from the current Taxonomy module.
	
	'_hierarchy' => { # Though the keys are unordered in this hash, its 
order is defined in rank.
		...
		'class' => 40674, # or mammlia, or the Taxonomy::Node
		'genus' => 'Mus',
		'species' => $tax_Node_musculus
		....
	},
	'_factory' => $factory, # explained later.
};

NOTE: the new taxonomy can represent more than species level, e.g. it 
is flexible to represent a object at genus level without species.

$taxNode_mammalia = {
	'object_id' => 40674, # NCBI taxid, and the reason why it is called 
'object_id' for the consistence to Bio:;IdentifiableI
	'rank' => 'class',
	'name' => 'Mammalia', # scientific name
	'common_name' => 'mammals', # Genbank common name, as NCBI site uses 
the term.
	'alias' => { # a hash with name_class as key and variant name as value
		'' => ''
	},
	'_factory' => $factory
};


$taxNode_mouse = {
	'object_id' => 10090,
	'rank' => 'species',
	'names' => { # This is a general solution!!
		'specific' => ['musclus'],
		'common' => ['mouse', 'Mickey'],
		'includes' => ['nude mice']
	}
};	

OBJECTS

Bio::Taxonomy will override all methods in Bio::Species, for the sake 
of backwards compatibility. If the tax object represents a level higher 
than species, the sub 'binomial' returns undef, otherwise simple make 
the result by combining the species and genus; the sub 'classification' 
will look like "

foreach(@ranks){
	unshift @classification, $taxonomy{$_} if defined exists $taxonomy{$_}
}


Bio::Taxonomy::Node has NO reference to either the parent node or 
taxonomy object, so that Node objects can be freely shared among 
Taxonomy. Tricky: once a Node object is created, it should be changed 
on its content. If a Taxonomy requires one of its Nodes modified, it 
has to make a new Node, in case that Node was shared by other Taxonomy.

Definitely, we need a Taxonomy factory, like Jason's Bio::DB::Taxonomy 
or what we are going to create in bioperl-db. Both Taxonomy objects and 
Node ones have a reference to this factory, so that Taxonomy can be 
created automatically, and Node can ask who his parent is, 
($node->get_parent_node, e. g. 
$node->_factory->find_parent_node($node)).

Comments, please, and I will transform the idea into the code.


Thanks.

Juguang



------------ATGCCGAGCTTNNNNCT--------------
Juguang Xiao
Bioinformatics Engineer
Temasek Life Sciences Laboratory, National University of Singapore
1 Research Link,  Singapore 117604
fax: (+65) 68727007

juguang at tll.org.sg



More information about the Bioperl-l mailing list