[Bioperl-l] Bio::*Taxonomy* changes

Chris Fields cjfields at uiuc.edu
Tue Jul 18 01:36:12 UTC 2006


On Jul 17, 2006, at 4:33 PM, Sendu Bala wrote:

> Chris Fields wrote:
>> There was some interest in getting Bio::Species to delegate to
>> Bio::Taxonomy::Node, so having scientific_name() would help quite  
>> a bit
>> since the name used on the ORGANISM line is the scientific name  
>> (well, is
>> supposed to be; famous last words).
>
> Can you clarify exactly what you mean here? Preferably with an  
> example?
> ORGANISM line of which file format?
> The reason I ask is that I still feel we need to do parsing of the  
> names
> for species rank and lower:

Sorry, should have clarified; GenBank sequence format.  Here's the link:

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

The ORGANISM annotation line for a GenBank record contains the formal  
scientific name for the organism along with the lineage.  I believe  
SwissProt/EMBL and several other RichSeq formats do the same.  The  
lineage that is also present is almost always abbreviated, so it's  
not always possible to determine the formal rankings strictly from  
the file with any real degree of reliability (hence the past problems  
with Bio::Species).

>
> # The 'scientific name' for humans could be considered to be 'Homo  
> sapiens'.
> # Taxid 9606 in the NCBI taxonomy database has rank 'species' and
> ScientificName 'Homo sapiens'.
> # For sanity, Bio::*Taxonomy* likes to interpret this  
> ScientificName as
> 'sapiens' so that the genus is not held redundantly. It provides a
> binomial() method to give you 'Homo sapiens' again if you want it.
> # I plan on maintaining this; scientific_name() would give you the
> non-redundant sibling-unique name 'sapiens'. binomial() on a species
> rank and lower would give you 'Homo sapiens' (presumably grabbing the
> 'Homo' from the parent node with rank 'genus', or similar).

I think you should use scientific_name to designate the full formal  
scientific name for an organism according to the way NCBI describes  
it for that particular node (nothing more, except removing the <>  
stuff you mentioned earlier) and as it would appear for the ORGANISM  
line.  Otherwise you'll run into serious species/subspecies/strain  
headaches (see below).  If you want real genus/species (i.e. nothing  
extra, like strains or subspecies), separate them out and store them  
using a genus/species get/set if possible; the binomial them will  
give back the two name genus species designation.

Here are a couple of example ones in (this is in XML, using  
EUtilities).  These were retrieved using NCBI TaxIDs using Elink from  
a list of protein GI's (~700 of them total), so represent the actual  
NCBI TaxID linked with the sequence file.  If you try breaking these  
apart into species, what happens to the strain/subspecies stuff?   
Notice that many of these nodes, which come directly from protein  
GI's, also have no rank.

...
   <TaxId>376686</TaxId>
   <ScientificName>Flavobacterium johnsoniae UW101</ScientificName>
   <OtherNames>
     <Synonym>Flavobacterium johnsoniae NBRC 14942</Synonym>
     <Synonym>Flavobacterium johnsoniae IFO 14942</Synonym>
     <Synonym>Flavobacterium johnsoniae IAM 14304</Synonym>
     <Synonym>Flavobacterium johnsoniae MYX.1.1.1</Synonym>
     <Synonym>Flavobacterium johnsoniae NCIB 11054</Synonym>
     <Synonym>Flavobacterium johnsoniae DSM 2064</Synonym>
     <Synonym>Flavobacterium johnsoniae LMG 1341</Synonym>
     <Synonym>Flavobacterium johnsoniae ATCC 17061</Synonym>
     <EquivalentName>Flavobacterium johnsoniae strain UW101</ 
EquivalentName>
     <EquivalentName>Flavobacterium johnsoniae str. UW101</ 
EquivalentName>
   </OtherNames>
   <ParentTaxId>986</ParentTaxId>
   <Rank>no rank</Rank>
   <Division>Bacteria</Division>
...


   <TaxId>370552</TaxId>
   <ScientificName>Streptococcus pyogenes MGAS10270</ScientificName>
   <OtherNames>
     <EquivalentName>Streptococcus pyogenes strain MGAS10270</ 
EquivalentName>
     <EquivalentName>Streptococcus pyogenes str. MGAS10270</ 
EquivalentName>
   </OtherNames>
   <ParentTaxId>301448</ParentTaxId>
   <Rank>no rank</Rank>
   <Division>Bacteria</Division>
...

   <TaxId>224308</TaxId>
   <ScientificName>Bacillus subtilis subsp. subtilis str. 168</ 
ScientificName>
   <OtherNames>
     <Synonym>Bacillus subtilis subsp. subtilis 168</Synonym>
   </OtherNames>
   <ParentTaxId>135461</ParentTaxId>
   <Rank>no rank</Rank>
   <Division>Bacteria</Division>

> Good, bad or ugly? I would prefer it works like this and we agree to
> differ with NCBI on what the 'scientific name' of a species node  
> should
> be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling
> binomial() (which I propose will actually give the correct answer,  
> even
> for bacteria and viruses).

This is where I would strongly disagree (though I agree that the way  
NCBI uses 'scientific name' is a bit off).

We are using the NCBI tax database, anf as such we are somewhat at  
the mercy of the NCBI tax nomenclature, unfortunately.

If NCBI decides to change their official definition for the  
scientific name to something that made a bit more sense, the XML and  
dump data will reflect that and we won't have many problems adapting  
since the scientific name will always conform to their definition.   
But if we split the information up ad hoc then we are bound for  
disaster; it's just way too much headache to worry about.  We could  
always point to the official NCBI definition as the one we adopt and  
then assign the tagged information from the node directly to  
scientific_name (no globbing together at all).  Bio::Species could  
delegate likewise fro the ORGANISM line, so there's no piecemeal  
attempts to get Humpty Dumpty to fit back together again.

You could go through and get the lineage from the XML/dump file data  
and try to sort the genus/species out, then paste it all back  
together (fingers crossed!), but I think it's more headache than it's  
worth to split these up, then hope that you can paste them back  
together again and always expect to get the same results.

Chris

> Perhaps the short-hand (and the classifier used in name()) shouldn't
> mention the word 'scientific' to avoid confusion? But a) what else  
> would
> we call it?, and b) for all ranks above species it /is/ the  
> scientific name.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list