[Bioperl-l] Bio::*Taxonomy* changes

Chris Fields cjfields at uiuc.edu
Tue Jul 18 19:34:29 UTC 2006


...
> [regarding changes to Bio::Taxonomy::Node]
> 
> Actually, I'm really strongly leaning toward getting rid of the
> following methods and new() options (and giving up entirely on being
> able to keep 'sapiens' somewhere):
> 
> -organelle, organelle()
> -division, division()
> -sub_species, sub_species()
> -variant, variant()
> species(), validate_species_name()
> genus()
> binomial()
> 
> As far as I can see none of these methods have any place in a generic
> Node class. If you want to know what your species is you have to be
> rank() 'species' and you just call scientific_name(). The above kind of
> methods belong in something like Bio::Species or similar, NOT in Node.
> Does anyone disagree? Can anyone offer a justification for keeping these
> methods?

Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to
have Bio::Species delegate methods to Bio::Taxonomy::Node.  So any changes
to Node will affect Bio::Species to some degree.

If you can get the lineage from XML, you could set many of these based on
the rank given.  Jason uses XML::Twig in Bio::DB::Taxonomy::entrez to parse
out the XML data into Bio::Taxonomy::Node objects; it shouldn't be difficult
to leave some methods based on rank (genus, species, etc) as simple get/set
methods for the time being and leave the heavy lifting to the modules
dealing directly with the data.  

Bio::Species could then delegate data/methods over to Bio::Taxonomy::Node
fairly easily.  If there is no genus/species data to be grabbed (either it
doesn't exist or isn't present for some reason), then simply leave it as
undef.  

That's also why I thought binomial() could stick around; if you have both
the genus() and species() you could grab both using binomial(), building in
special cases or error handling in case genus() or species() or both return
undef.  I don't see the problem in keeping this as long as users know what
it means: by detailing the method in POD.  If someone complains we tell them
to RTFM.

> Changes I haven't yet discussed but have already made (but not committed):
> 
> *parent_taxon_id = \&parent_id;
> *common_name = \&common_names;
> -factory and factory() removed, since there is no
> Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use
> of a factory once set, and a factory seems redundant when we're a node
> with a -dbh.
> validate_name() removed because it just returns 1.
> 
...
> Actually, I've gone with node_name as the 'pure' and best method to set
> the name of your node with, and made scientific_name an alias of it
> (though it behaves as suggested earlier in the thread).

I don't have any problem with that.  As long as it conforms somewhat to the
NCBI definition to prevent confusion I think it's okay.

> >> What should I do with the classification array? Should it hold the raw
> >> ScientificName like:
> >> join(',', $node->classification) eq 'Homo sapiens, Homo,
> >> Homo/Pan/Gorilla group [...]'?
> 
> (I've decided to do it the above way for consistency with scientific_name)

I think that's fine.

...
 
> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
> when they build the classification array. I had no intention of changing
> this behaviour.

If you ignore nodes with 'no rank' there will be major problems when
retrieving certain TaxID's from protein/nucleotide sequences.  I had posted
some sample XML from many NCBI TaxIDs taken from sequence files and via
ELink and a good many of those nodes (most of them from genome projects)
have 'no rank'.  

  <TaxId>376686</TaxId>
  <ScientificName>Flavobacterium johnsoniae UW101</ScientificName>
  ...
  <ParentTaxId>986</ParentTaxId>
  <Rank>no rank</Rank>

...

  <TaxId>373903</TaxId>
  <ScientificName>Halothermothrix orenii H 168</ScientificName>
  ...
  <ParentTaxId>31909</ParentTaxId>
  <Rank>no rank</Rank>


These aren't 'edge cases' anymore but now are pretty common from genome
sequencing.  I would just assign 'no rank' to rank() and have the node
retained for DB purposes.

It seems that the tax dump loses quite a bit of information somewhere along
the way that shows up in the XML.  Or am I wrong?

> >       <TaxId>1760</TaxId>
> >       <ScientificName>Actinobacteria (class)</ScientificName>
> >       <Rank>class</Rank>
> 
> Ugh. I guess my proposal to remove <> bits via flatfile extends to
> removing () bits via entrez. We don't need unique names; we can use
> object_id() when uniqueness matters.

The XML parsing in Taxonomy::entrez will take care of the <tags> and retains
the character data in between.  It would be a matter of setting the parser
correctly to grab the relevant data and assign it properly.

> >> I don't think binomial() would serve any useful purpose now, however.
> >
> > We could use binomial() for the 'scientific name' as the rest of the
> world
> > knows it (as in binomial nomenclature), having it built from genus-
> species
> > like you had originally suggested.
> 
> No, see above. I don't think it makes the slightest bit of sense for a
> Node to go around trying to build things from a parent it may or may not
> have. Again, binomial() is a method for something like Bio::Species, not
> a generic Node class.

Bio::Species, from what I gather, was initially created to hold the tax data
from GenBank/EMBL/SwissProt (RichSeq) files and is not DB-aware.
Bio::Taxonomy::Node was supposed to be like Bio::Species and also be
DB-aware:

http://thread.gmane.org/gmane.comp.lang.perl.bio.general/4284/focus=4321

Again, Bio::Species methods are supposed to (eventually) delegate to
Bio::Taxonomy::Node, so the two are closely linked along with their methods.


Any way we go about it here (keeping certain methods and tossing others,
changing the data returned, etc), it looks like there will be API issues
down the road which will directly affect anyone using tax data.  That
affects bioperl-db directly as well as any other bioperl-based DB's which
rely on tax data.  So we need to tread a bit carefully when making major
changes to make sure that they work for bioperl-db and anywhere else that
may require it.

Chris




More information about the Bioperl-l mailing list