[Bioperl-l] Bio::Species/Bio::Taxonomy changes
Chris Fields
cjfields at uiuc.edu
Sun Jul 23 16:53:32 EDT 2006
Sendu, Hilmar, et al,
I was looking through SeqIO::genbank and though I would bring up a
couple of things to think about re: GenBank Taxonomy information.
This is how NCBI defines the names used for SOURCE and ORGANISM
according to the latest GenBank release notes:
SOURCE - Common name of the organism or the name most frequently used
in the literature. Mandatory keyword in all annotated entries/one or
more records/includes one subkeyword.
ORGANISM - Formal scientific name of the organism (first line)
and taxonomic classification levels (second and subsequent lines).
Mandatory subkeyword in all annotated entries/two or more records.
According to their sample file page (http://www.ncbi.nlm.nih.gov/
Sitemap/samplerecord.html), the SOURCE is this:
Free-format information including an abbreviated form of the organism
name, sometimes followed by a molecule type. (See section 3.4.10 of
the GenBank release notes for more info.)
The SOURCE can also include the organelle and also may include
additional information, such as an abbreviated name and a common name
in parentheses.
...
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
...
Setting scientific_name() isn't a problem; acc. to the above
definition, it is the full name on the ORGANISM line. The lineage
(or classification() array) is also straight-forward. The common_name
(), though as used by Bio::SeqIO::genbank, is the entire SOURCE line
(not just the abbreviated name, but the name and everything else).
No additional parsing is performed on it. write_seq() also seems to
do the wrong thing when rebuilding the SOURCE line as well as the
method writes the subspecies to the line.
I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try
using Bio::Taxonomy::Node objects instead of Bio::Species, then get
the parsing for these lines corrected and simplified. Essentially,
the way NCBI describes it, the main name on the line is actually the
free-form abbreviated name, the name in parentheses is the common
name (optionally present), and the organelle precedes all of these if
present. I want to try getting common_name() to match the common
name found for taxonomy (baker's yeast) rather than have it be a
simple container, add an abbreviated_name() method for the name
container for the SOURCE line, and have the organelle() method
actually be used if an organelle is present (it doesn't seem to be
set at the moment in SeqIO::genbank).
Right now, I have NO idea how EMBL, DDBJ, other formats deal with
organism info; I would think that the main three (GenBank/EMBL-
SwissProt/DDBJ) handle them similarly...(Famous Last Words)
I also propose (I'll probably get yelled at here) NOT actively
supporting additional parsing of species, subspecies, etc directly
from a file w/o a DB lookup. As in, leave species, subspecies, genus
parsing from the flatfile as is (no longer support it) or remove it
completely and leave them unset.
I haven't looked, but I have a strong feeling that the species
parsing in Bio::SeqIO is different from format to format. It really
seems like more trouble than it's worth to maintain this, especially
as there is perfectly valid Taxonomy database information available
either locally using a flatfile or via Entrez. If people want to
have reliable $species->species or $species-genus for taxonomy
information, they will need to have the db_handle() set for the
Bio::Taxonomy::Node object and have an Node-based method to reset
species, genus, etc to the tax database information (maybe
reset_taxon or something along those lines).
Okay, rambled on enough. Any thoughts?
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list