[Bioperl-l] Bio::*Taxonomy* changes
Chris Fields
cjfields at uiuc.edu
Mon Jul 17 21:36:12 EDT 2006
On Jul 17, 2006, at 4:33 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> There was some interest in getting Bio::Species to delegate to
>> Bio::Taxonomy::Node, so having scientific_name() would help quite
>> a bit
>> since the name used on the ORGANISM line is the scientific name
>> (well, is
>> supposed to be; famous last words).
>
> Can you clarify exactly what you mean here? Preferably with an
> example?
> ORGANISM line of which file format?
> The reason I ask is that I still feel we need to do parsing of the
> names
> for species rank and lower:
Sorry, should have clarified; GenBank sequence format. Here's the link:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
The ORGANISM annotation line for a GenBank record contains the formal
scientific name for the organism along with the lineage. I believe
SwissProt/EMBL and several other RichSeq formats do the same. The
lineage that is also present is almost always abbreviated, so it's
not always possible to determine the formal rankings strictly from
the file with any real degree of reliability (hence the past problems
with Bio::Species).
>
> # The 'scientific name' for humans could be considered to be 'Homo
> sapiens'.
> # Taxid 9606 in the NCBI taxonomy database has rank 'species' and
> ScientificName 'Homo sapiens'.
> # For sanity, Bio::*Taxonomy* likes to interpret this
> ScientificName as
> 'sapiens' so that the genus is not held redundantly. It provides a
> binomial() method to give you 'Homo sapiens' again if you want it.
> # I plan on maintaining this; scientific_name() would give you the
> non-redundant sibling-unique name 'sapiens'. binomial() on a species
> rank and lower would give you 'Homo sapiens' (presumably grabbing the
> 'Homo' from the parent node with rank 'genus', or similar).
I think you should use scientific_name to designate the full formal
scientific name for an organism according to the way NCBI describes
it for that particular node (nothing more, except removing the <>
stuff you mentioned earlier) and as it would appear for the ORGANISM
line. Otherwise you'll run into serious species/subspecies/strain
headaches (see below). If you want real genus/species (i.e. nothing
extra, like strains or subspecies), separate them out and store them
using a genus/species get/set if possible; the binomial them will
give back the two name genus species designation.
Here are a couple of example ones in (this is in XML, using
EUtilities). These were retrieved using NCBI TaxIDs using Elink from
a list of protein GI's (~700 of them total), so represent the actual
NCBI TaxID linked with the sequence file. If you try breaking these
apart into species, what happens to the strain/subspecies stuff?
Notice that many of these nodes, which come directly from protein
GI's, also have no rank.
...
<TaxId>376686</TaxId>
<ScientificName>Flavobacterium johnsoniae UW101</ScientificName>
<OtherNames>
<Synonym>Flavobacterium johnsoniae NBRC 14942</Synonym>
<Synonym>Flavobacterium johnsoniae IFO 14942</Synonym>
<Synonym>Flavobacterium johnsoniae IAM 14304</Synonym>
<Synonym>Flavobacterium johnsoniae MYX.1.1.1</Synonym>
<Synonym>Flavobacterium johnsoniae NCIB 11054</Synonym>
<Synonym>Flavobacterium johnsoniae DSM 2064</Synonym>
<Synonym>Flavobacterium johnsoniae LMG 1341</Synonym>
<Synonym>Flavobacterium johnsoniae ATCC 17061</Synonym>
<EquivalentName>Flavobacterium johnsoniae strain UW101</
EquivalentName>
<EquivalentName>Flavobacterium johnsoniae str. UW101</
EquivalentName>
</OtherNames>
<ParentTaxId>986</ParentTaxId>
<Rank>no rank</Rank>
<Division>Bacteria</Division>
...
<TaxId>370552</TaxId>
<ScientificName>Streptococcus pyogenes MGAS10270</ScientificName>
<OtherNames>
<EquivalentName>Streptococcus pyogenes strain MGAS10270</
EquivalentName>
<EquivalentName>Streptococcus pyogenes str. MGAS10270</
EquivalentName>
</OtherNames>
<ParentTaxId>301448</ParentTaxId>
<Rank>no rank</Rank>
<Division>Bacteria</Division>
...
<TaxId>224308</TaxId>
<ScientificName>Bacillus subtilis subsp. subtilis str. 168</
ScientificName>
<OtherNames>
<Synonym>Bacillus subtilis subsp. subtilis 168</Synonym>
</OtherNames>
<ParentTaxId>135461</ParentTaxId>
<Rank>no rank</Rank>
<Division>Bacteria</Division>
> Good, bad or ugly? I would prefer it works like this and we agree to
> differ with NCBI on what the 'scientific name' of a species node
> should
> be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling
> binomial() (which I propose will actually give the correct answer,
> even
> for bacteria and viruses).
This is where I would strongly disagree (though I agree that the way
NCBI uses 'scientific name' is a bit off).
We are using the NCBI tax database, anf as such we are somewhat at
the mercy of the NCBI tax nomenclature, unfortunately.
If NCBI decides to change their official definition for the
scientific name to something that made a bit more sense, the XML and
dump data will reflect that and we won't have many problems adapting
since the scientific name will always conform to their definition.
But if we split the information up ad hoc then we are bound for
disaster; it's just way too much headache to worry about. We could
always point to the official NCBI definition as the one we adopt and
then assign the tagged information from the node directly to
scientific_name (no globbing together at all). Bio::Species could
delegate likewise fro the ORGANISM line, so there's no piecemeal
attempts to get Humpty Dumpty to fit back together again.
You could go through and get the lineage from the XML/dump file data
and try to sort the genus/species out, then paste it all back
together (fingers crossed!), but I think it's more headache than it's
worth to split these up, then hope that you can paste them back
together again and always expect to get the same results.
Chris
> Perhaps the short-hand (and the classifier used in name()) shouldn't
> mention the word 'scientific' to avoid confusion? But a) what else
> would
> we call it?, and b) for all ranks above species it /is/ the
> scientific name.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list