[Bioperl-l] Bio::Taxonomy confusion

Chris Fields cjfields at uiuc.edu
Thu May 11 17:16:19 UTC 2006


> I think you'll see it is different and mostly a limitation of the
> genbank format and the Bio::Species objects that you get from a
> genbank parse do represent the full capabilities of a Taxonomy::Node.

I definitely see the rational for using a TaxID lookup (I think Hilmar said
so as well), especially for local databases.  I wonder, though, if there is
a way that RichSeqs like GenBank, when passed through SeqIO, can be just be
'short-circuited' using the sequence builder to just accept what's on the
SOURCE or ORGANISM line of a file as is, without forcing it into
Bio::Species/Bio::Taxonomy::Node.  Or maybe diminish the role of the
SOURCE/ORGANISM lines altogether to just simple Annotation objects and place
much greater emphasis on the TaxID itself, in effect decoupling the TaxID
(taxonomic information) from SOURCE/ORGANISM (annotation information).

In other words, have GenBank/EMBL classification lines and organism lines
essentially stay like they are in the input file (use simple objects).
Then, if one were really intent on getting the full name, classification,
etc., or one wanted to store their sequences in bioperl-db, they would be
required to either have a local db of NCBI Taxonomy or remote access to a
similar database (NCBI or something else) so a lookup could be accomplished
using the TaxID.  If they us BioSQL, then require them to preload their
BioSQL database with NCBI's taxonomy, something Hilmar already strongly
suggests.

If anyone isn't interested in the taxonomic information or doesn't want to
bother grabbing the database or setting up remote access, tough luck; just
grab the Bio::Annotation/Bio::Species object and use that.  As the saying
goes, "you can't be all things to all people."  At some point you have to
throw your arms in the air, do the best you can, but give up trying to
please everyone.

> I am happy for someone to overhaul things, but it all boils down to
> inferring which part of a list of names is the species versus sub-
> species versus strain when none of the members of the list are
> labeled.  This is some of the same problems we have for swissprot as
> well.  I just don't think we can do it right only from the genbank
> file data so I don't see a lot of point of expecting Bio::Species to
> provide more than a representation of what is in the file and just
> return that array.
> 
> 
> It has seemed like we need to special case things pretty heavily or
> do a lookup in the taxonomydb for something.
> 
> Can you guess what value is the strain versus sub-species?  What
> happens when there is a two part strain name (space separated) and a
> sub-species or variety designation?
> 
> SOURCE      Staphylococcus haemolyticus JCSC1435
>    ORGANISM  Staphylococcus haemolyticus JCSC1435
>              Bacteria; Firmicutes; Bacillales; Staphylococcus.
> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=279808
> strain is JCSC1435
> 
> versus
> SOURCE      Muntiacus muntjak vaginalis
>    ORGANISM  Muntiacus muntjak vaginalis
>              Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> Euteleostomi;
>              Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla;
> Ruminantia;
>              Pecora; Cervidae; Muntiacinae; Muntiacus.
> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9887
> species is muntjak, sub-species vaginalis ?
> 
> versus
> SOURCE      Aspergillus nidulans FGSC A4
>    ORGANISM  Aspergillus nidulans FGSC A4
>              Eukaryota; Fungi; Ascomycota; Pezizomycotina;
> Eurotiomycetes;
>              Eurotiales; Trichocomaceae; Emericella.
> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=227321
> 
> Genus should be Aspergillus or Emericella ?
> 
> Strain and subspecies/variety in the same entry
> SOURCE      Cryptococcus neoformans var. grubii H99
>    ORGANISM  Cryptococcus neoformans var. grubii H99
>              Eukaryota; Fungi; Basidiomycota; Hymenomycetes;
>              Heterobasidiomycetes; Tremellomycetidae; Tremellales;
> Tremellaceae;
>              Filobasidiella.
> http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=235443

Definitely tricky!  This really points out the problem here.  It used to be
a problem for only a few cases but with so many bacterial and fungal genomes
that's changed.  

The Frankia XML example has the scientific name set to "Frankia sp. CcI3",
which matches the SOURCE/ORGANISM line in NCBI's GenBank files and the OS
line in EMBL files.  It looks like the lines are parsed into and then built
from the ground-up in Bio::SeqIO::genbank using Bio::Species objects, which,
in my case with the strain designation, is where the problem lies.  They
could be placed in annotation objects with (-tagname=> 'SOURCE', value
=>'Frankia sp. CcI3') or similar settings.  Or simplify Bio::Species to only
represent the information in the GenBank SOURCE/ORGANISM/CLASSIFICATION or
EMBL OS/OC lines and nothing more complex than that (no complex taxonomy;
for that you use the TaxID and local database). 

Okay,  I need to lay off the coffee now...

Chris

> On May 11, 2006, at 10:57 AM, Chris Fields wrote:
> 
> > Heh...
> >
> > To tell the truth, I haven't looked at Bio::DB::Taxonomy in any
> > depth yet,
> > but I myself have seen issues with the way Bio::Species treats
> > bacterial
> > strains (I guess this also involves Bio::Taxonomy::Node since
> > that's what
> > Bio::Species delegates to).  Seems it likes to repeat some strain
> > names when
> > using $seq->species->common_name.  Not a killer problem but
> > annoying since
> > the correct name is in the source tag in the feature table!  I
> > 'could' take
> > a look at it but I can't guarantee quick results.
> >
> > Jason, I could add Taxonomy to the EUtilities overhaul I mentioned
> > to you
> > previously but it'll take awhile to get going.  I'm really more
> > interested
> > in getting epost-esearch-efetch sequence retrieval up and running
> > first with
> > the same API as Bio::DB::GenBank/Genpept and
> > Bio::DB::Query::GenBank, donate
> > the code (late summer/fall???) after working out namespace issues
> > so it
> > doesn't conflict with current Bio::DB::WebDBSeqI inheritance.  I
> > suppose I
> > could also look at Bio::DB:Taxonomy to see what's up in the next
> > couple of
> > weeks (after conference), unless someone gets to it sooner.
> >
> > Chris
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
> >> Sent: Thursday, May 11, 2006 7:05 AM
> >> To: Chris Fields
> >> Cc: bioperl-l at lists.open-bio.org; 'Sendu Bala'
> >> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
> >>
> >> Great - now we just need someone to volunteer to actually work on
> >> this.
> >>
> >> The current code grabs most of this but I believe expects a different
> >> XML
> >>
> >>
> >> On May 10, 2006, at 11:36 PM, Chris Fields wrote:
> >>
> >>> I think you can get pretty much everything now, though I can
> >>> definitely see
> >>> the use of a local database.  I ran a few tests, really unrelated
> >>> to this,
> >>> using the powerscripting test page at NCBI for eutils (for the
> >>> curious, at
> >>> http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/eu.cgi) and was
> >>> able to
> >>> retrieve XML-formatted taxonomic information; here's the bacterium
> >>> Frankia
> >>> sp. CcI3 TaxID info, which looks like they have everything set up
> >>> by rank.
> >>> It gives quite a bit of information.
> >>>
> >>> <?xml version="1.0"?>
> >>> <!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN"
> >>> "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
> >>> <TaxaSet>
> >>>
> >>> <Taxon>
> >>>   <TaxId>106370</TaxId>
> >>>   <ScientificName>Frankia sp. CcI3</ScientificName>
> >>>   <ParentTaxId>1854</ParentTaxId>
> >>>   <Rank>species</Rank>
> >>>   <Division>Bacteria</Division>
> >>>   <GeneticCode>
> >>>     <GCId>11</GCId>
> >>>     <GCName>Bacterial and Plant Plastid</GCName>
> >>>   </GeneticCode>
> >>>   <MitoGeneticCode>
> >>>     <MGCId>0</MGCId>
> >>>     <MGCName>Unspecified</MGCName>
> >>>   </MitoGeneticCode>
> >>>   <Lineage>cellular organisms; Bacteria; Actinobacteria;
> >>> Actinobacteria
> >>> (class); Actinobacteridae; Actinomycetales; Frankineae; Frankiaceae;
> >>> Frankia</Lineage>
> >>>   <LineageEx>
> >>>     <Taxon>
> >>>       <TaxId>131567</TaxId>
> >>>       <ScientificName>cellular organisms</ScientificName>
> >>>       <Rank>no rank</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>2</TaxId>
> >>>       <ScientificName>Bacteria</ScientificName>
> >>>       <Rank>superkingdom</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>201174</TaxId>
> >>>       <ScientificName>Actinobacteria</ScientificName>
> >>>       <Rank>phylum</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>1760</TaxId>
> >>>       <ScientificName>Actinobacteria (class)</ScientificName>
> >>>       <Rank>class</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>85003</TaxId>
> >>>       <ScientificName>Actinobacteridae</ScientificName>
> >>>       <Rank>subclass</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>2037</TaxId>
> >>>       <ScientificName>Actinomycetales</ScientificName>
> >>>       <Rank>order</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>85013</TaxId>
> >>>       <ScientificName>Frankineae</ScientificName>
> >>>       <Rank>suborder</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>74712</TaxId>
> >>>       <ScientificName>Frankiaceae</ScientificName>
> >>>       <Rank>family</Rank>
> >>>     </Taxon>
> >>>     <Taxon>
> >>>       <TaxId>1854</TaxId>
> >>>       <ScientificName>Frankia</ScientificName>
> >>>       <Rank>genus</Rank>
> >>>     </Taxon>
> >>>   </LineageEx>
> >>>   <CreateDate>1999/10/22</CreateDate>
> >>>   <UpdateDate>2005/01/19</UpdateDate>
> >>>   <PubDate>2000/02/02</PubDate>
> >>> </Taxon>
> >>>
> >>>
> >>> Chris
> >>>
> >>>> -----Original Message-----
> >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >>>> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
> >>>> Sent: Wednesday, May 10, 2006 7:54 PM
> >>>> To: Sendu Bala
> >>>> Cc: bioperl-l at lists.open-bio.org
> >>>> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
> >>>>
> >>>> I would use the implementation that talks to the flatfile db as the
> >>>> standard here.  nodes are defined by the data in from taxonomy dump
> >>>> dbs from ncbi.
> >>>> the eutils is pretty worthless except for taxid->name or
> >>>> reverse, you
> >>>> can't get the full taxonomy (or couldn't when that
> >>>> implementation was
> >>>> written).
> >>>>
> >>>> The "name" method refers to the name of the node - each level in
> >>>> the
> >>>> taxonomy can have a "name".
> >>>>
> >>>> The bits of hackiness relate to wrapping the node object as a
> >>>> Bio::Species and/or being able to read  a genbank file and the
> >>>> organism taxonomy data as a list and instantiating.  If we could
> >>>> rely
> >>>> on everything being in a DB of course this would be simpler.
> >>>>
> >>>> Another problem is the depth of the taxonomy is not constant for
> >>>> every node so assuming that a fixed number of slots will be
> >>>> filled in
> >>>> to generate the taxonomy leads to problems.
> >>>>
> >>>> Use the flatfile implementation (Bio::DB::Taxonomy::flatfile) as
> >>>> the
> >>>> best example of working code as this is how I really wanted it to
> >>>> work, the Bio::Species hacks are only there to shoehorn data
> >>>> retrieved from genbank files in.  With the flatfile implementation
> >>>> you have to walk all the way up the db hierarchy to get the kingdom
> >>>> for a node so you do have to build up the classification
> >>>> hierarchy as
> >>>> each node only stores data about itsself.
> >>>>
> >>>> I'm not exactly sure what you are proposing to do, but would
> >>>> definitely enjoy another pair of hands, I don't really have time to
> >>>> mess with it any time soon.
> >>>>
> >>>> -jason
> >>>> On May 10, 2006, at 5:30 AM, Sendu Bala wrote:
> >>>>
> >>>>> Hi,
> >>>>> I'm a little confused as to how names are supposed to work in
> >>>>> Bio::Taxonomy::Node.
> >>>>>
> >>>>> In the bioperl versions that I've looked at a Node doesn't seem to
> >>>>> store
> >>>>> the most important information about itself - it's scientific name
> >>>>> - in
> >>>>> an obvious place. bioperl 1.5.1 puts it at the start of the
> >>>>> classification list. I'd have thought sticking it in -name would
> >>>>> make
> >>>>> more sense, but this is used only for the GenBank common name.
> >>>>>
> >>>>> The Bio::Taxonomy docs still suggests:
> >>>>>
> >>>>> my $node_species_sapiens = Bio::Taxonomy::Node->new(
> >>>>>    -object_id => 9606, # or -ncbi_taxid. Requird tag
> >>>>>    -names => {
> >>>>>        'scientific' => ['sapiens'],
> >>>>>        'common_name' => ['human']
> >>>>>    },
> >>>>>    -rank => 'species'  # Required tag
> >>>>> );
> >>>>>
> >>>>> and whilst Bio::Taxonomy::Node does not accept -names, it does
> >>>>> have a
> >>>>> 'name' method which claims to work like:
> >>>>>
> >>>>> $obj->name('scientific', 'sapiens');
> >>>>>
> >>>>> This kind of thing would be really nice, but afaics
> >>>>> Bio::Taxonomy::Node->new takes the -name value and makes a common
> >>>>> name
> >>>>> out of it, whilst the name() method passes any 'scientific'
> >>>>> name to
> >>>>> the
> >>>>> scientific_name() method which is unable to set any value (and
> >>>>> warns
> >>>>> about this), only get.
> >>>>>
> >>>>> It seems like the need to have this classification array work the
> >>>>> same
> >>>>> way as Bio::Species is causing some unnecessary restrictions.
> >>>>> Can't
> >>>>> the
> >>>>> more sensible idea of having a dedicated storage spot for the
> >>>>> ScientificName and other parameters be used, with the
> >>>>> classification
> >>>>> array either being generated just-in-time from the hash-stored
> >>>>> data, or
> >>>>> indeed being generated from the Lineage field?
> >>>>>
> >>>>>
> >>>>> Also, why does a node store the complete hierarchy on itself in
> >>>>> the
> >>>>> classification array? If we're going that far, why don't the
> >>>>> Bio::DB::Taxonomy modules like Bio::DB::Taxonomy::entrez just
> >>>>> have a
> >>>>> get_taxonomy() method instead of a get_Taxonomy_Node() method.
> >>>>> get_taxonomy() could, from a single efetch.fcgi lookup, create a
> >>>>> complete Bio::Taxonomy with all the nodes. Whilst most nodes would
> >>>>> only
> >>>>> have a minimum of information, if you could simply ask a node
> >>>>> what its
> >>>>> rank and scientific name was you could easily build a
> >>>>> classification
> >>>>> array, or ask what Kingdom your species was in etc.
> >>>>>
> >>>>> Are there good reasons for Taxonomy working the way it does in
> >>>>> 1.5.1, or
> >>>>> would I not be wasting my time re-writing things to make more
> >>>>> sense
> >>>>> (to me)?
> >>>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Sendu.
> >>>>> _______________________________________________
> >>>>> Bioperl-l mailing list
> >>>>> Bioperl-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>> --
> >>>> Jason Stajich
> >>>> Duke University
> >>>> http://www.duke.edu/~jes12
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>
> >> --
> >> Jason Stajich
> >> Duke University
> >> http://www.duke.edu/~jes12
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12





More information about the Bioperl-l mailing list