[Bioperl-l] Bio::DB::Taxonomy:: mishandles species, subspecies/variant names

Mon May 15 16:08:30 UTC 2006

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Monday, May 15, 2006 3:18 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::DB::Taxonomy:: mishandles species,
> subspecies/variant names
> 
> Chris Fields wrote:
> > Sendu Bala wrote:
> >> In bioperl up to at least 1.5.1, when one of the database modules
> >> comes across a species rank it does:
> >>
> >> if ($rank eq 'species') { # get rid of genus from species name
> >> (undef,$taxon_name) = split(/\s+/,$taxon_name,2); }
> >
> > The XML example from NCBI Taxonomy I mentioned previously seems to
> > have everything in the classification, from superkingdom down to
> > species (no strain unfortunately, and I'm nit sure about subspecies);
> > if it's missing the rank then the designation doesn't exist or is
> > tagged as 'no rank'.  Like I mentioned before I'm not intimately
> > familiar Bio::Taxonomy, Bio::DB::Taxonomy, or Bio::Species, so I
> > don't have a clue as to how everything is parsed and plugged in to
> > Bio::Taxonomy objects.  I do know that XML::Twig is used for parsing
> > through the data so it shouldn't be too hard to change what you
> > want.
> 
> Yes, that's all true, but I'm not sure what it has to do with what I was
> saying. FYI, you do get a 'subspecies' rank but no 'variant' rank. In my
> own implementation I change the rank of all 'no rank' Nodes below
> species to 'variant'.

Sorry; wandered a bit off topic there.

> > I haven't tried using Bio::DB::Taxonomy directly yet, but I would
> > have thought that the binomial is just built from the XML twig
> > 'LineageEx' Rank=Genus + Rank=Species, that the genus comes from the
> > tag 'Genus' and species from 'Species', and that the scientific name
> > is from the tag 'ScientificName'.  Guess not.
> 
> No. See above for what it actually does. That is a copy/paste from the
> code (there, $taxon_name == ScientificName). When it finds a species
> rank it does that split because in the
> ncbi taxonomy database the 'genus' rank for a human has a ScientificName
> of 'Homo', whilst the 'species' rank has a ScientificName of 'Homo
> sapiens', and the bioperl model (quite rightly, I think) wants the
> 'species' node to not have information of other nodes (well, except for
> the classification array). So it removes the 'Homo' from 'Homo sapiens'
> giving a species name of 'sapiens'. This then allows the binomial method
> to return 'Homo sapiens' instead of 'Homo Homo sapiens'.
> 
> (though in a bizarre twist, and this is one of my problems with how
> names are currently represented in the Taxonomy modules, 'Scientific
> Name' and 'binomial' are synonymous)

Ah, now I see.  That's a bit screwy, but it's not on our end so we have to
deal with it.  I also noticed that subspecies also contains the entire
string:

    <Taxon>
      <TaxId>135461</TaxId>
      <ScientificName>Bacillus subtilis subsp. subtilis</ScientificName>
      <Rank>subspecies</Rank>
    </Taxon>

As for the 'scientific_name' method when accessed through Bio::DB::Taxonomy,
I don't get the actual scientific name for the node (from the GenBank
ORGANISM line) almost every time; I get the name with the strain chopped off
instead and a number of times the names get mangled.  The regexes below only
grab from the topmost tags:

Script:
---------------------------------
#! perl
use strict;
use warnings;

use Bio::DB::Taxonomy;
my $file = shift @ARGV;

print "\nNCBI XML output ScientificName tag for each node:\n";
my @taxid =();
open (TAXFILE, "<tax.xml") or die "Can't open file:$!\n";
while (<TAXFILE>){
	if (/^\s{2}<TaxId>(\d+)<\/TaxId>/) {
		print "$1\t";
		push @taxid, $1;
	}
	print "$1\n" if /^\s{2}<ScientificName>(.*)<\/ScientificName>/;
}
close TAXFILE;

print "\nBio::DB::Taxonomy scientific_name:\n";
for my $id (@taxid){
	my $factory = Bio::DB::Taxonomy->new(-source => 'entrez');
	my $node = $factory->get_Taxonomy_Node(-taxonid => $id);
	print $node->ncbi_taxid,"\t",$node->scientific_name,"\n";
}
---------------------------------

Output:
---------------------------------
NCBI XML output ScientificName tag for each node:
191218  Bacillus anthracis str. A2012
198094  Bacillus anthracis str. Ames
222523  Bacillus cereus ATCC 10987
224308  Bacillus subtilis subsp. subtilis str. 168
226186  Bacteroides thetaiotaomicron VPI-5482
226900  Bacillus cereus ATCC 14579
246194  Carboxydothermus hydrogenoformans Z-2901
260799  Bacillus anthracis str. Sterne
261594  Bacillus anthracis str. 'Ames Ancestor'
264462  Bdellovibrio bacteriovorus HD100
272558  Bacillus halodurans C-125
272559  Bacteroides fragilis NCTC 9343
279010  Bacillus licheniformis ATCC 14580
281309  Bacillus thuringiensis serovar konkukian str. 97-27
288681  Bacillus cereus E33L
295405  Bacteroides fragilis YCH46
66692   Bacillus clausii KSM-K16
76114   Azoarcus sp. EbN1

Bio::DB::Taxonomy scientific_name:
191218  Bacillus cereus group anthracis
198094  Bacillus cereus group anthracis
222523  Bacillus cereus group cereus
224308  subtilis Bacillus subtilis subsp. subtilis
226186  Bacteroides thetaiotaomicron
226900  Bacillus cereus group cereus
246194  Carboxydothermus hydrogenoformans
260799  Bacillus cereus group anthracis
261594  Bacillus cereus group anthracis
264462  Bdellovibrio bacteriovorus
272558  Bacillus halodurans
272559  Bacteroides fragilis
279010  Bacillus licheniformis
281309  Bacillus cereus group thuringiensis
288681  Bacillus cereus group cereus
295405  Bacteroides fragilis
66692   Bacillus clausii
76114   Azoarcus sp.
---------------------------------
Note Bacillus subtilis in the Bio::Tax output above.  Not one of those is
the scientific name as defined by NCBI (and most taxonomists for that
matter).

So, in a nutshell, there's a problem here.  I don't know if your fix works
for that, but I definitely don't think the 'scientific name' should be
assembled ad hoc but should be taken from the tagname for that node.  I am
currently reduced to grabbing the feature primary_tagged 'source' and
getting the 'organism' tagname from that.  I cannot stress enough that it
should NOT be that way.

As for 'binomial' == 'scientific_name', I agree; I see it as well and that
should be fixed.

...
> Perhaps, but again I'm not sure what this has to do with what I was
> saying. If you don't want your species name to contain your genus name
> you have to do some kind of parsing. My post merely pointed out that the
> parsing currently in bioperl does not work for viruses and possibly
> other species. I'd like to think that someone cares about this error and
> would do the simple fix I offered, or that they already know about the
> problem and have done their own fix.

Again me going off-topic, so my apologies; it's more to do with my
frustrations with Bio::Species (not Bio::DB::Taxonomy).  My point here was,
since there is no real way to surmise from a GenBank flatfile what the
taxonomic ranks are w/o guessing (which seems to break more often than not
when dealing with complex names), there shouldn't be any tie to Bio::Tax
objects, at least directly.  I guess methods could be incorporated into
Bio::Species for those who want to give it a try, but I would like to get a
GenBank file, for once, in which the scientific name/binomial name isn't
mangled by Bio::Species.

Back to Bio::DB::Taxonomy; I don't have a problem with implementing your
methods here; on the contrary, if they fix my problem above then I'll be
more than glad to.  I can't get to it immediately but maybe later
today/tomorrow.

> > I'm also not sure that forcing a lookup for every TaxID in every
> > sequence every time it's passed through SeqIO is the best way to go
> > either, though I think it should be required for storing sequences.
> > It's a tricky balance.
> 
> In my own implementation any database lookups are cached, and you have
> the option of not doing any database lookup at all and 'faking' a
> taxonomy from the supplied list of names (so it works just like normal
> Bio::Seq).
>
> 
> > I still think that maybe we should absolve ourselves from using
> > SOURCE/ORGANISM or OS/OC information in GenBank files as anything
> > more than strictly annotation, or reconstruct Bio::Species to maybe a
> >  Bio::Annotation::Species object to handle that annotation and either
> >  deprecate Bio::Species or separate it completely from any
> > Bio::Taxonomy objects.  It would really simplify things.  Then, if
> > anyone is interested in taxonomy, either install a local database or
> >  use Entrez efetch, and then use Bio::DB::Taxonomy (fixed of course)
> >  to grab the TaxID info.
> 
> My personal view is that having it as an annotation would serve no real
> purpose. For me the whole point of any kind of species representation in
> bioperl is to allow you to compare species in a biologically meaningful
> way. If it's just some annotation then that means it's basically
> free-form text and you have no guarantee that two sequences from the
> same species are annotated exactly the same - no guarantee that your
> code would identify that those sequences are from the same species.
> The only other useful thing that a species object needs to do it let you
> know how related two different species are - you need to be able to ask
> what a species' class, kingdom etc. are. Again, not viable with an
> annotation - you need something strict like a properly constructed
> Taxonomy.

My point is, a large number of users do NOT use, nor care about, taxonomic
information to the degree they need to know the entire classification of the
organism; many are just as happy about getting the scientific name only,
which is in the GenBank/EMBL file itself.  To take one extreme, it is not
productive to force every user to download the NCBI tax database and use
lookups just to convert sequences from EMBL format to GenBank format.  It's
not productive to allow users to spam the NCBI tax database remotely either,
so hardcoding lookups is, IMHO, a big mistake.  

> I guess it comes down to the philosophy of parsing a file. Do you try
> and reflect exactly what the file contains, letter for letter, so that
> your resulting object can recreate that file letter for letter, or do
> you parse the file and extract the correct /meaning/ in order to be more
> useful?
> I think there can be a choice by the user, and this is best done by
> making Bio::Species a clever wrapper around an improved Bio::Taxonomy,
> as in my own implementation.

I understand both philosophies, but the latter implies that you know the
intention of the ones submitting the sequence.  99.9% of the time that's
fine, something I can live with.  However, when we mess up something as
simple as getting the scientific name for an organism when the information
is directly in the flat file (ORGANISM line) by trying to 'imply' what the
classification is, yes, I get frustrated.  Even more frustrating to me is
that Bio::DB::Taxonomy, which should return accurate information directly
from the Taxonomy database, still manages to screw up the scientific name.  

The NCBI definition in the sample record:

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

state that the ORGANISM line contains the formal scientific name and it's
lineage (no ranking).  If the lineage is very long it is abbreviated so you
don't get the same thing as you would through using TaxID. 

So, in essence, I believe you are correct, that Bio::Species can be used as
a 'wrapper' for Bio::Taxonomy objects, but only up to a certain degree with
caveats or warnings for possible inaccuracies.  I also believe that lookups
should be allowed but optional, not required (i.e. left up to the user, as
you state).  

I just feel that it's somewhat misleading to imply, by delegating to
Bio::Taxonomy, that Bio::Species contains accurate taxonomic information
when NCBI themselves state that the GenBank flatfile classification can be
incomplete and does not supply rankings (genus, species) in the file.  It's
our best guess in most cases, and a best guess by definition is not very
accurate.  If you want taxonomic accuracy, use the TaxID and a local tax
database.  I feel that we shouldn't punish those who don't worry/care about
taxonomy by implementing Bio::Species with methods that mangle data that's
directly in the flat file they're parsing.

Okay, not to cut short this discussion, but I have to get back to $job.
I'll try adding your fixes in a bit later today/tomorrow; if they pass tests
I'll commit them in.

Chris