[Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul

Sendu Bala bix at sendu.me.uk
Sat Aug 5 15:42:10 UTC 2006


After the initial round of changes to Taxonomy described at
http://bugzilla.open-bio.org/show_bug.cgi?id=2047 (now committed),
further changes will allow for the transition of Bio::Species to
Bio::Taxonomy::Node (renamed to Bio::Taxon), and for Taxon to be fully
usable without external database access.

In brief: rename Bio::Taxonomy::Node to Bio::Taxon, make Bio::Taxon
implement Bio::Tree::NodeI, make Bio::Species a Bio::Taxon, remove all
Bio::Species-related-backward-compatible methods from Bio::Taxon, create
Bio::DB::Taxonomy::list, update Bio::SeqIO::genbank et al.

The following is the set of changes that have been made (with all
relevant tests passing), but not committed. Feedback is encouraged.
These notes are also available at
http://bugzilla.open-bio.org/show_bug.cgi?id=2061 for easier reference
later.


(in the following notes, use of the name-case word 'Taxon' refers to the
module Bio::Taxon or instance of that class, while 'taxon' refers to the
concept of a taxonomic unit)


Bio::DB::Taxonomy, ::*
----------------------

# API-CHANGES
get_Taxonomy_Node() renamed get_taxon(). get_Taxonomy_Node() is a
synonym of get_taxon(), eventually to be deprecated.

New methods ancestor() and each_Descendent() correspond to similar
methods in Bio::Taxon and Bio::Tree::NodeI, freeing up the need to store
parent_id on each Taxon.

New internal method _handle_internal_id(). See Implementation notes below.

# Implementation changes
Normally when you create a Bio::Taxon it automatically receives a new
unique internal id. However when you request the same Taxon from a
database more than once you always get an object with the same internal
id (allows get_lca to work, allows you to modify one copy of a returned
object but still compare it to another copy and see they are supposed to
be the same taxon). This even applies across different databases. The
Taxon objects returned will still have different memory locations.


Bio::DB::Taxonomy::flatfile
---------------------------

# API-CHANGES
get_Children_Taxids is deprecated - method no longer part of the
DB::Taxonomy interface, and superseded by each_Descendent (which is
actually implemented by all databases).

# Implementation changes
No longer includes the fake root node 'root'; there are multiple roots
now (10239, 12884, 12908, 29384 and 131567). This means when getting the
lineage you no longer have to remove the root node. This is now
consistent with the results possible with entrez.
NB: You have to delete your current indexes before you will notice the
change.


Bio::DB::Taxonomy::entrez
-------------------------

# API-CHANGES
get_node has new option -full that tells it to retrieve full details on
a taxon from the website. (Otherwise, it may return a Taxon with minimal
information if only minimal information had previously been cached.)

# Implementation changes
Caches the data it gets from the website and tries to minimise the
number of website accesses it does.


Bio::DB::Taxonomy::list
-----------------------

# NEW
An implementation of Bio::DB::Taxonomy that accepts lists of words to
build a database. Used especially by Bio::Species for backward
compatibility purposes, but also useful generally to quickly and easily
create a lineage of Bio::Taxon objects/ a Tree.


Bio::Tree::TreeI
----------------

# BUG-FIXES
number_nodes() returned the number of descendants belonging to the root
node, but forgot to count the root node itself. Now number_nodes() ==
scalar(get_nodes()).


Bio::Tree::Tree
---------------

# API-CHANGES
Added -node option to new() which will call get_lineage_nodes() on the
supplied NodeI and set the tree root that way. This is so you can easily
make a tree from a Bio::Taxon. In order that the Tree resulting from a
Bio::Taxon with a db_handle doesn't end up pulling in the entire
database, in the process of finding the root from the -node, ancestor()
/ add_Descendent() is set for each member of the lineage, which means
the database will no longer be asked what the ancestor or descendents of
the taxa are.


Bio::Tree::TreeFunctionsI
-------------------------

# API-CHANGES
New method get_lineage_nodes(). Returns all the ancestors of a
particular node, up to the tree's defined root node.

get_lca() can now also accept just a list of nodes, and also more than 2
nodes.

Removed _check_two_nodes() since no longer necessary.

New method splice(). Removes requested nodes from a tree, making the
ancestors of the removed node's descendants the removed node's ancestor
(ie. remove nodes without making the tree fall apart).

New method contract_linear_paths(). Splices out all nodes in the tree
that have an ancestor and only one descendant.

New method merge_lineage(). Merges a lineage of nodes with an existing Tree.

# Implementation changes
get_lca() uses get_lineage_nodes(), and is the correct implementation;
previously not guaranteed to give correct answer. Can get the lca of
more than 2 nodes.

reroot() uses get_lineage_nodes().

Methods distance(), is_monophyletic() and is_paraphyletic()
reimplemented with the new get_lca().

find_node() no longer warns about an unknown search type (allowing you
to search on -rank and any other thing in the future).


Bio::Tools::Phylo::PAML
-----------------------

# Implementation changes
Methods that make use of get_lca() reimplemented with the new get_lca().
(otherwise, PAML tests no longer passed)


Bio::Tree::Node
---------------

# Implementation changes
ancestor() now correctly removes and adds descendant from previous/new
ancestor when changing ancestor.


t/Node.t
--------
Added tests for setting ancestor()


Bio::Taxonomy::Node
-------------------

# DEPRECATED (name change)
isa Bio::Taxon

# Implementation changes
No code; delegates to Bio::Taxon


Bio::Taxon
----------

# NEW (name change from Bio::Taxonomy::Node)
Changes below relate to changes to Bio::Taxonomy::Node

# API-CHANGES
Removed the following options from new(): -classification,
-sub_species, -variant and -organelle. The corresponding methods are no
longer present.

New option to new(): -id. For Tree::Node compatibility. -object_id and
-ncbi_taxid are no longer mentioned in docs but still work.

The -dbh option to new() no longer defaults to any database. A
Bio::Taxon is now fully usable without ever setting a database handle.

Removed the methods binomial(), species(), genus(), sub_species(),
variant(), classification() and show_all(). Not appropriate to have
rank-specific methods in a class that models any single rank. Definitely
not appropriate to store information about other taxons in a Taxon.
These questions can be answered using Tree* methods, or with
Bio::Species.

Removed method organelle(). Organelle isn't part of a taxonomy. Other
modules like SeqIO should have their own storage of organelle
information as necessary (But Bio::Species retains organelle() in the
mean time).

Removed methods get_Lineage_Nodes() and get_LCA_Node(). For these kinds
of methods you should now use Bio::Tree::TreeFunctionsI methods.

You can no longer set parent_id(). The id of your parent is determined
by the Taxon that is your ancestor. This method is no longer needed
(previously it was central to the workings of the object), so is now
deprecated. It issues a warning if you try and set its value.

get_Parent_Node() eventually to be deprecated, is now a synonym of new
method ancestor(). (For Tree::Node compatibility.)

get_Children_Nodes() eventually to be deprecated, is now a synonym of
new method each_Descendent(). (For Tree::Node compatibility.)

object_id() eventually to be deprecated, is now a synonym of new method
  id(). (For Tree::Node compatibility.)

# Implementation changes
is(also)a Bio::Tree::Node.

division() was implemented via $self->name('division', at _). Now
name('division') will only allow one value to be set, and division()
only ever returns a single scalar or undef, never an array.

common_names() returns the last common_name in scalar context (instead
of first), so set/get/set/get works as expected with common_name().

db_handle() similar to before when getting, but now setting the handle
will locate $self in the new database (by id or name) and merge data
(eg. if rank was 'no rank' and new database node has rank 'species',
$self->rank() will become 'species').

get_Parent_Node() (ne ancestor()) and get_Children_Nodes() (ne
each_Descendent()) now use the Bio::Tree::Node implementation.
ancestor() falls back to asking the database for the ancestor if one had
not been manually set by the user. each_Descendent does NOT fall back to
the database, preventing the whole database being pulled into a Tree
object made with a Bio::Taxon.

parent_id() now gets the ancestor Taxon with ancestor() and returns
$ancestor->id().

Had to remove the clean up methods from Bio::Tree::Node since they were
in a CODE ref, preventing Bio::Species objects from being frozen with
Storable. Will come up with a better solution in the future.


Bio::Taxonomy
-------------

# DEPRECATED
Redundant


Bio::Taxonomy::Taxon
--------------------

# DEPRECATED
Redundant


Bio::Taxonomy::Tree
-------------------

# DEPRECATED
Redundant


Bio::Taxonomy::FactoryI
-----------------------

# DEPRECATED
Redundant


Bio::Species
------------

# Implementation changes
Bio::Species isa Bio::Taxon.

No method uses validate_species_name() any more. (but the method remains
unaltered, as does validate_name() which just returns 1 - no change).

classification() set implemented as:
Set db_handle() to a new Bio::DB::Taxonomy::list with the supplied
classification array and make a Bio::Tree::Tree of self, stored in self.
Getting the classification implemented as:
Return the scientific_name() of each Taxon returned by our
tree->get_lineage_nodes.

Methods ncbi_taxid(), division() and common_name() implemented by Taxon.

Methods species(), genus(), subspecies() and variant() no longer get/set
elements in the classification array or store direct values. They are
implemented like:
Ask our tree for the taxon with rank() eq method name and set/get
the scientific_name of that.
Otherwise, for methods species() and genus() assume we are rank()
'species', our parent taxon is rank() 'genus' and try again. For
subspecies() and variant(), fall back to old implementation (store data
directly on self).

binomial() prefers to simply return scientific_name() if we are a Taxon
with rank() 'species' and the scientific_name is at least a 2 word
scalar. It interprets the 'FULL' option as wanting the trinomial name
and prefers to simply return scientific_name() if we have rank()
'subspecies' or 'variant' and at least 3 word scalar. Failing these two
cases, it falls back on the old implementation (build 'genus species'
from the classification), but with a little more intelligence to try and
not duplicate names.

# Behaviour changes
An indirect new behaviour is that the SeqIO modules will probably return
->species() as the real species name (eg. 'Homo sapiens'), not the
previously (and sometimes incorrectly) munged name (eg. 'sapiens').

# Notes
Stores a Bio::Tree::Tree on itself, had to remove its clean up methods
since they were in a CODE ref, preventing us from being frozen with
Storable. Will come up with a better solution in the future.


Bio::SeqIO::*
-------------
A number of these modules make use of Bio::Species when parsing
taxonomic information. They probably all have/had problems. I've only
investigated genbank to any significant depth; the others need
to be properly tested to see if when they read taxonomic data in they
can output it again identically to the input file. It is probably the
case that some fail at this currently. (I simply don't have time myself
to make all these modules perfect.)


Bio::SeqIO::bsml_sax
--------------------

# BUG-FIXES
It used to include the genus twice in the classification array of
Bio::Species object. Now it doesn't.


Bio::SeqIO::embl
----------------

# BUG-FIXES
When the OC lines include the species name, the Bio::Species
classification array included the true species name as a rank above
genus and the real genus duplicated as a rank above that. Now it doesn't.


Bio::SeqIO::genbank
-------------------

# BUG-FIXES
Now that Bio::Species isa Bio::Taxon, it is possible to ensure that
output of input matches the input (in the SOURCE and ORGANISM lines at
least). Usage of Bio::Species re-implemented to get all tests in
t/genbank.t to pass.


t/genbank.t
-----------
Modified some tests to expect the correct answer, ie.
$bio_species_obj->species now expects 'Mus musculus', not 'musculus'.


t/Index.t
---------
Modified some tests to expect the correct answer, ie.
$bio_species_obj->species now expects 'Homo sapiens', not 'sapiens'.


scripts/taxa/taxonomy2tree.PLS
------------------------------
Added some extra options to define the location of the database indexes
and files, or use the entrez on-line database instead. (Note how entrez
and flatfile are now truly interchangeable.)

Reimplemented using the new Bio::Taxon system. Now much simpler. You
also get the correct answer, eg. instead of
(("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo
sapiens")"Homo/Pan/Gorilla group")Hominidae)root;
you now get
(("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo
sapiens")"Homo/Pan/Gorilla group")Hominidae)"cellular organisms";




More information about the Bioperl-l mailing list