[Bioperl-l] Bio::Taxonomy confusion

Thu May 11 03:36:39 UTC 2006

I think you can get pretty much everything now, though I can definitely see
the use of a local database.  I ran a few tests, really unrelated to this,
using the powerscripting test page at NCBI for eutils (for the curious, at
http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/eu.cgi) and was able to
retrieve XML-formatted taxonomic information; here's the bacterium Frankia
sp. CcI3 TaxID info, which looks like they have everything set up by rank.
It gives quite a bit of information. 

<?xml version="1.0"?>
<!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
<TaxaSet>

<Taxon>
  <TaxId>106370</TaxId>
  <ScientificName>Frankia sp. CcI3</ScientificName>
  <ParentTaxId>1854</ParentTaxId>
  <Rank>species</Rank>
  <Division>Bacteria</Division>
  <GeneticCode>
    <GCId>11</GCId>
    <GCName>Bacterial and Plant Plastid</GCName>
  </GeneticCode>
  <MitoGeneticCode>
    <MGCId>0</MGCId>
    <MGCName>Unspecified</MGCName>
  </MitoGeneticCode>
  <Lineage>cellular organisms; Bacteria; Actinobacteria; Actinobacteria
(class); Actinobacteridae; Actinomycetales; Frankineae; Frankiaceae;
Frankia</Lineage>
  <LineageEx>
    <Taxon>
      <TaxId>131567</TaxId>
      <ScientificName>cellular organisms</ScientificName>
      <Rank>no rank</Rank>
    </Taxon>
    <Taxon>
      <TaxId>2</TaxId>
      <ScientificName>Bacteria</ScientificName>
      <Rank>superkingdom</Rank>
    </Taxon>
    <Taxon>
      <TaxId>201174</TaxId>
      <ScientificName>Actinobacteria</ScientificName>
      <Rank>phylum</Rank>
    </Taxon>
    <Taxon>
      <TaxId>1760</TaxId>
      <ScientificName>Actinobacteria (class)</ScientificName>
      <Rank>class</Rank>
    </Taxon>
    <Taxon>
      <TaxId>85003</TaxId>
      <ScientificName>Actinobacteridae</ScientificName>
      <Rank>subclass</Rank>
    </Taxon>
    <Taxon>
      <TaxId>2037</TaxId>
      <ScientificName>Actinomycetales</ScientificName>
      <Rank>order</Rank>
    </Taxon>
    <Taxon>
      <TaxId>85013</TaxId>
      <ScientificName>Frankineae</ScientificName>
      <Rank>suborder</Rank>
    </Taxon>
    <Taxon>
      <TaxId>74712</TaxId>
      <ScientificName>Frankiaceae</ScientificName>
      <Rank>family</Rank>
    </Taxon>
    <Taxon>
      <TaxId>1854</TaxId>
      <ScientificName>Frankia</ScientificName>
      <Rank>genus</Rank>
    </Taxon>
  </LineageEx>
  <CreateDate>1999/10/22</CreateDate>
  <UpdateDate>2005/01/19</UpdateDate>
  <PubDate>2000/02/02</PubDate>
</Taxon>

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
> Sent: Wednesday, May 10, 2006 7:54 PM
> To: Sendu Bala
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
> 
> I would use the implementation that talks to the flatfile db as the
> standard here.  nodes are defined by the data in from taxonomy dump
> dbs from ncbi.
> the eutils is pretty worthless except for taxid->name or reverse, you
> can't get the full taxonomy (or couldn't when that implementation was
> written).
> 
> The "name" method refers to the name of the node - each level in the
> taxonomy can have a "name".
> 
> The bits of hackiness relate to wrapping the node object as a
> Bio::Species and/or being able to read  a genbank file and the
> organism taxonomy data as a list and instantiating.  If we could rely
> on everything being in a DB of course this would be simpler.
> 
> Another problem is the depth of the taxonomy is not constant for
> every node so assuming that a fixed number of slots will be filled in
> to generate the taxonomy leads to problems.
> 
> Use the flatfile implementation (Bio::DB::Taxonomy::flatfile) as the
> best example of working code as this is how I really wanted it to
> work, the Bio::Species hacks are only there to shoehorn data
> retrieved from genbank files in.  With the flatfile implementation
> you have to walk all the way up the db hierarchy to get the kingdom
> for a node so you do have to build up the classification hierarchy as
> each node only stores data about itsself.
> 
> I'm not exactly sure what you are proposing to do, but would
> definitely enjoy another pair of hands, I don't really have time to
> mess with it any time soon.
> 
> -jason
> On May 10, 2006, at 5:30 AM, Sendu Bala wrote:
> 
> > Hi,
> > I'm a little confused as to how names are supposed to work in
> > Bio::Taxonomy::Node.
> >
> > In the bioperl versions that I've looked at a Node doesn't seem to
> > store
> > the most important information about itself - it's scientific name
> > - in
> > an obvious place. bioperl 1.5.1 puts it at the start of the
> > classification list. I'd have thought sticking it in -name would make
> > more sense, but this is used only for the GenBank common name.
> >
> > The Bio::Taxonomy docs still suggests:
> >
> > my $node_species_sapiens = Bio::Taxonomy::Node->new(
> >    -object_id => 9606, # or -ncbi_taxid. Requird tag
> >    -names => {
> >        'scientific' => ['sapiens'],
> >        'common_name' => ['human']
> >    },
> >    -rank => 'species'  # Required tag
> > );
> >
> > and whilst Bio::Taxonomy::Node does not accept -names, it does have a
> > 'name' method which claims to work like:
> >
> > $obj->name('scientific', 'sapiens');
> >
> > This kind of thing would be really nice, but afaics
> > Bio::Taxonomy::Node->new takes the -name value and makes a common name
> > out of it, whilst the name() method passes any 'scientific' name to
> > the
> > scientific_name() method which is unable to set any value (and warns
> > about this), only get.
> >
> > It seems like the need to have this classification array work the same
> > way as Bio::Species is causing some unnecessary restrictions. Can't
> > the
> > more sensible idea of having a dedicated storage spot for the
> > ScientificName and other parameters be used, with the classification
> > array either being generated just-in-time from the hash-stored
> > data, or
> > indeed being generated from the Lineage field?
> >
> >
> > Also, why does a node store the complete hierarchy on itself in the
> > classification array? If we're going that far, why don't the
> > Bio::DB::Taxonomy modules like Bio::DB::Taxonomy::entrez just have a
> > get_taxonomy() method instead of a get_Taxonomy_Node() method.
> > get_taxonomy() could, from a single efetch.fcgi lookup, create a
> > complete Bio::Taxonomy with all the nodes. Whilst most nodes would
> > only
> > have a minimum of information, if you could simply ask a node what its
> > rank and scientific name was you could easily build a classification
> > array, or ask what Kingdom your species was in etc.
> >
> > Are there good reasons for Taxonomy working the way it does in
> > 1.5.1, or
> > would I not be wasting my time re-writing things to make more sense
> > (to me)?
> >
> >
> > Cheers,
> > Sendu.
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l