[Bioperl-l] Bio::*Taxonomy* changes

Chris Fields cjfields at uiuc.edu
Tue Jul 18 15:44:07 UTC 2006


> What about the existing genus(), species(), sub_species() and variant()
> methods? There would be no need for any logic to join things together,
> but I would still like to be able to get just 'sapiens' from somewhere.
> Can I use species() for that purpose (though again, species is strictly
> 'Homo sapiens')? Likewise sub_species() and variant() could hold the
> remaining non-redundant names. Or should all of these be deprecated
> because they don't really have a place in a generic Node class?

This is where Hilmar suggests that you have a bit of freedom in doing what
you want, as with binomial().  So species() should return species
('sapiens'), genus return genus, etc.  

At that level there will need to be some additional data munging since the
ranks below species seem to include the entire name, not just the species.
But this could be done from the lineage if all nodes are present and tagged
as such.  

> What about node_name()? Yet another synonym of scientific_name? (right
> now it grabs the common name(s)). Ugh.

I agree things need cleaning up.  You could always make node_name() an alias
for scientific_name() though it could just be deprecated.

> What should I do with the classification array? Should it hold the raw
> ScientificName like:
> join(',', $node->classification) eq 'Homo sapiens, Homo,
> Homo/Pan/Gorilla group [...]'?
> Or should it be like:
> join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla
> group [...]'?

Don't know what the dump file gives; the XML output using efetch via entrez
has the raw lineage (as appears in a GenBank sequence file) and the actual
full lineage with TaxID, rank, 'scientific name,' in the actual lineage
order.  I think one problem area will be the 'no rank' designations in the
lineage.  Note that the below example also has a species and no genus;
tricky!

<Taxon>
  <TaxId>312284</TaxId>
  <ScientificName>marine actinobacterium PHSC20C1</ScientificName>
  <OtherNames>
    <EquivalentName>marine actinobacterium strain PHSC20C1</EquivalentName>
    <EquivalentName>marine actinobacterium str. PHSC20C1</EquivalentName>
  </OtherNames>
  <ParentTaxId>78537</ParentTaxId>
  <Rank>species</Rank>
  <Division>Bacteria</Division>
...
  <Lineage>cellular organisms; Bacteria; Actinobacteria; Actinobacteria
(class); unclassified Actinobacteria; unclassified Actinobacteria
(miscellaneous)</Lineage>
<LineageEx>
    <Taxon>
      <TaxId>131567</TaxId>
      <ScientificName>cellular organisms</ScientificName>
      <Rank>no rank</Rank>
    </Taxon>
    <Taxon>
      <TaxId>2</TaxId>
      <ScientificName>Bacteria</ScientificName>
      <Rank>superkingdom</Rank>
    </Taxon>
    <Taxon>
      <TaxId>201174</TaxId>
      <ScientificName>Actinobacteria</ScientificName>
      <Rank>phylum</Rank>
    </Taxon>
    <Taxon>
      <TaxId>1760</TaxId>
      <ScientificName>Actinobacteria (class)</ScientificName>
      <Rank>class</Rank>
    </Taxon>
    <Taxon>
      <TaxId>52018</TaxId>
      <ScientificName>unclassified Actinobacteria</ScientificName>
      <Rank>no rank</Rank>
    </Taxon>
    <Taxon>
      <TaxId>78537</TaxId>
      <ScientificName>unclassified Actinobacteria
(miscellaneous)</ScientificName>
      <Rank>no rank</Rank>
    </Taxon>
  </LineageEx>


> The latter is how it currently works (when it works correctly); I would
> rather fix it than lose the logic completely, but if we're staying true
> to proper classification (vs. what a programmer might expect), I guess I
> must use the raw ScientificName?
>
> > binomial() isn't part of the NCBI taxonomy definition, so you have
> > freedom there to report what suits you.
> 
> I don't think binomial() would serve any useful purpose now, however. I
> can either deprecate it or make it a synonym of scientific_name() or
> both. Or binomial() can be a version of scientific_name() that complains
> if you use it on a rank higher or lower than species. As for species()
> et al., it may have no place in a generic Node class. Thoughts?

The use of scientific_name() in this context would be more to conform with
what NCBI defines it as rather than as the actual definition; this should be
explicitly stated as such in POD and is more for long-term maintainability.
No matter what is done here, you will have some degree of confusion: those
who want strict adherence to the term 'scientific name' and those who want
the method to conform to NCBI's definition.  Better to document the
reasoning for it in some way that risk the random masses complaining.

We could use binomial() for the 'scientific name' as the rest of the world
knows it (as in binomial nomenclature), having it built from genus-species
like you had originally suggested.  That's what Hilmar suggested as an
'experimental' area of sorts, since NCBI doesn't use that particular term in
its taxonomy definition.

Chris





More information about the Bioperl-l mailing list