[Bioperl-l] Bio::DB::Taxonomy:: mishandles species, subspecies/variant names

Chris Fields cjfields at uiuc.edu
Mon May 15 21:29:14 UTC 2006


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Nadeem Faruque
> Sent: Monday, May 15, 2006 2:47 PM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::DB::Taxonomy:: mishandles
> species,subspecies/variant names
> 
> >> My personal view is that having it as an annotation would serve no
> >> real
> >> purpose. For me the whole point of any kind of species
> >> representation in
> >> bioperl is to allow you to compare species in a biologically
> >> meaningful
> >> way. If it's just some annotation then that means it's basically
> 
> I understand the need to find the species name of entries, especially
> now that so many complete genomes have been given their own strain-
> specific tax nodes, and I also think it is a shame that the ncbi tax
> dump does not give a rank to entries such as these (they cannot
> easily be distinguished from unofficial ranks higher in the tree
> without ascending the tree).
> Would it be useful for the species name to be included within EMBL
> file headers, eg in a line called OB (OB is a terrible suggestion
> based on 'Organism Binomial' since OS is already in use)?
> 
> eg two examples of the species 'Apple stem grooving virus', where the
> second one would appear to be a different species without delving
> into the tax tree or the inclusion of an OB line.
> 
> AC   D14995; S47260;
> DE   Apple stem grooving virus genome, complete sequence.
> OS   Apple stem grooving virus
> OB   Apple stem grooving virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flexiviridae;
> OC   Capillovirus.
> 
> AC   AY646511;
> DE   Citrus tatter leaf virus strain Kumquat 1, complete genome.
> OS   Citrus tatter leaf virus
> OB   Apple stem grooving virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flexiviridae;
> OC   Capillovirus.

Jason also mentions a few examples (see below).  The problem lies in the
fact that EMBL and GenBank flatfiles do not give hierarchy ranking for
taxonomy, so it's a best guess.  What I'm seeing is that the guess is wrong
more often than not when it comes to complex scientific names (viruses,
bacteria, etc).  Notice the doubling of the strain in the following GenBank
files passed through SeqIO (genbank->genbank conversion, BTW; haven't tried
EMBL):

SOURCE      Azoarcus sp. EbN1 EbN1
  ORGANISM  Azoarcus sp.
            Bacteria; Proteobacteria; Betaproteobacteria; Rhodocyclales;
            Rhodocyclaceae; Azoarcus.

SOURCE      Mycobacterium sp. KMS KMS
  ORGANISM  Mycobacterium sp.
            Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
            Corynebacterineae; Mycobacteriaceae; Mycobacterium.

SOURCE      Mycobacterium tuberculosis C C
  ORGANISM  Mycobacterium tuberculosis
            Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
            Corynebacterineae; Mycobacteriaceae; Mycobacterium;
Mycobacterium;
            tuberculosis complex; Mycobacterium.

SOURCE      Bacillus subtilis subsp. subtilis str. 168 subtilis str. 168
  ORGANISM  Bacillus subtilis subsp.
            Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.

Here are Jason's examples, for posterity:

Can you guess what value is the strain versus sub-species?  What happens
when there is a two part strain name (space separated) and a sub-species or
variety designation?

SOURCE      Staphylococcus haemolyticus JCSC1435
   ORGANISM  Staphylococcus haemolyticus JCSC1435
             Bacteria; Firmicutes; Bacillales; Staphylococcus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=279808
strain is JCSC1435

versus
SOURCE      Muntiacus muntjak vaginalis
   ORGANISM  Muntiacus muntjak vaginalis
             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
             Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla;
Ruminantia;
             Pecora; Cervidae; Muntiacinae; Muntiacus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9887
species is muntjak, sub-species vaginalis ?

versus
SOURCE      Aspergillus nidulans FGSC A4
   ORGANISM  Aspergillus nidulans FGSC A4
             Eukaryota; Fungi; Ascomycota; Pezizomycotina; Eurotiomycetes;
             Eurotiales; Trichocomaceae; Emericella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=227321
Genus should be Aspergillus or Emericella ?

Strain and subspecies/variety in the same entry
SOURCE      Cryptococcus neoformans var. grubii H99
   ORGANISM  Cryptococcus neoformans var. grubii H99
             Eukaryota; Fungi; Basidiomycota; Hymenomycetes;
             Heterobasidiomycetes; Tremellomycetidae; Tremellales;
Tremellaceae;
             Filobasidiella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=235443


> > My point is, a large number of users do NOT use, nor care about,
> > taxonomic
> > information to the degree they need to know the entire
> > classification of the
> > organism; many are just as happy about getting the scientific name
> > only,
> > which is in the GenBank/EMBL file itself.  To take one extreme, it
> > is not
> > productive to force every user to download the NCBI tax database
> > and use
> > lookups just to convert sequences from EMBL format to GenBank
> > format.  It's
> > not productive to allow users to spam the NCBI tax database
> > remotely either,
> > so hardcoding lookups is, IMHO, a big mistake.
> 
> I don't think you need to add any information to turn an embl-format
> file into a Genbank flatfile, but maybe I'm missing something obvious.

The issue is the way the SOURCE and ORGANISM lines are handled (OS/OC lines
in EMBL, I believe), which is using a Bio::Species object.  The problem is,
like I mentioned above, no hierarchal ranking is in the flat file, just the
order of the ranking.  We can try to make a best guess based on that but
it's obviously very tricky, particularly when dealing with subspecies,
strains, etc.  

NCBI also states that many times the classification can be too long for a
file so may be incomplete (I think they leave out nodes which have 'no rank'
tags, but I can't be completely sure), so there's another issue.

Anyway, this is where the lookup would come in, which would require a local
taxonomy  database (we can't spam the NCBI remote database, that would just
be rude) which would give the complete taxonomic classification if it worked
properly.  

So now we have three possible situations:

1) One extreme : We require a lookup to get it right (which, BTW, it
currently doesn't); this by default requires a local database.  
2) Middle of the road : we try and guess the information as best as we can
with the information given (the current situation); this is breaking more
and more often now, so is becoming more unreliable.
3) Other extreme : we punt and absolve ourselves of even trying to parse the
data and just have a strict tagname->value or similar simple construct to
handle the data.

#3 as default with option to do #1 is probably best (least error prone with
option for most information), with caching to speed up lookups as Sendu Bala
does now.

Chris

 
> Nadeem
> 
> 
> --
> Dr S.M. Nadeem N. Faruque
> 9 Barley Court
> Saffron Walden
> Essex  CB11 3HG
> 01799 500 120
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list