[Bioperl-l] Bio::Species/Bio::Taxonomy changes

Sendu Bala bix at sendu.me.uk
Mon Jul 24 03:58:57 EDT 2006


Chris Fields wrote:
> Sendu, Hilmar, et al,
> 
> I was looking through SeqIO::genbank and though I would bring up a 
> couple of things to think about re: GenBank Taxonomy information.
[...]
> SOURCE	- Common name of the organism or the name most frequently used
>  in the literature. Mandatory keyword in all annotated entries/one or
>  more records/includes one subkeyword.
[...]
> Free-format information including an abbreviated form of the organism
>  name, sometimes followed by a molecule type. (See section 3.4.10 of
>  the GenBank release notes for more info.)
> 
> The SOURCE can also include the organelle and also may include 
> additional information, such as an abbreviated name and a common name
>  in parentheses.

More specifically:

(from 3.4.10 ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
The SOURCE field consists of two parts. The first part is found after
the SOURCE keyword and contains free-format information including an
abbreviated form of the organism name followed by a molecule type;
multiple lines are allowed, but the last line must end with a period.
The second part consists of information found after the ORGANISM
subkeyword. The formal scientific name for the source organism (genus
and species, where appropriate) is found on the same line as ORGANISM.
The records following the ORGANISM line list the taxonomic
classification levels, separated by semicolons and ending with a
period.


> The common_name (), though as used by Bio::SeqIO::genbank, is the
> entire SOURCE line (not just the abbreviated name, but the name and
> everything else). No additional parsing is performed on it.
> write_seq() also seems to do the wrong thing when rebuilding the
> SOURCE line as well as the method writes the subspecies to the line.
> 
> I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try
> using Bio::Taxonomy::Node objects instead of Bio::Species, then get
> the parsing for these lines corrected and simplified.  Essentially,
> the way NCBI describes it, the main name on the line is actually the
> free-form abbreviated name, the name in parentheses is the common 
> name (optionally present), and the organelle precedes all of these if
> present.  I want to try getting common_name() to match the common 
> name found for taxonomy (baker's yeast) rather than have it be a 
> simple container, add an abbreviated_name() method for the name 
> container for the SOURCE line, and have the organelle() method 
> actually be used if an organelle is present (it doesn't seem to be 
> set at the moment in SeqIO::genbank).

This is not how I read the specification. Everything on the the same
line as 'Source' is free-format text and therefore cannot be parsed. For
the purposes of writing out it must be stored as-is, but it serves no
other useful purpose. The file also provides the scientific name which 
can be used to do an accurate database lookup, which in turn gives you 
access to the common names, like "baker's yeast".

On a side note, why would we care about 'organelle' when we're dealing 
with taxonomy? Why does the NCBI taxonomy db have a slot for organelle?



More information about the Bioperl-l mailing list