[Bioperl-l] Bio::Taxonomy changes

Mon Jul 24 18:40:03 UTC 2006

I have to do a little catching up on things here; lots of conversation this
morning!  

According to NCBI, the SOURCE line can hold organelle data, an abbreviated
version of the scientific name, and the GenBank common name in parentheses.
No other information is present.

The ORGANISM lines contains the scientific name (NCBI definition) and the
lineage, generally only ranked node but not always.  I believe it was Nadeem
Faruque who indicated that there is some way that NCBI marks the ranks which
determines whether or not they appear in the lineage. 

Here's what Bio::SeqIO::genbank does to get data into and out of GenBank
files:
------------------------------------------------------
Bio::SeqIO::genbank in methods next_seq() and _read_GenBank_Species():

1) Bio::Species acts as a container object

2) The SOURCE data is dumped entirely into common_name() (ughhhh).  There is
some additional work done as well before instantiating a Bio::Species ; if
it is considered an unknown organism there is no Bio::Species object
returned.  We should get rid of that bit; every GenBank SOURCE has a TaxID
and therefore has a node, including plasmids and unknowns.  There will be no
genus/species or anything else set for that group.

3) The ORGANISM name was divided up into genus(), species(), and
subspecies(), based on the classification array (again, ughhh).  

4) The classification array is split into an array and dumped into
classification()

5) No parsing of potential organelle information occurs.  None.  Zero.
Squat.

6) TaxID is grabbed from the 'source' seqfeature and assigned via
ncbi_taxid().  We could use this to also grab the organelle, etc.

------------------------------------------------------

Bio::SeqIO::genbank in method write_seq():

1)  SOURCE line : use the common_name data for output, but tag on the
subspecies information (?!?!?!). 

2)  ORGANISM lines : the name is rebuilt from the organelle() (which should
be on the SOURCE line) and genus and species, which comes from the
classification array (?!?!?!).  The classification array is rebuilt from
classification()

------------------------------------------------------

Much of this may be cruft from changes in the official GenBank format that
we neglected to update.  

However, I think there's WAY too much hand-wringing about trying to get
everything into genus() species() etc without anything more that the (very
scant) information in the flatfile, esp. when using the classification array
as a basis.  The only places where reliable tax information is present in
the flatfile are:

1)  SOURCE line (organelle, common name, abbreviated name)
2)  ORGANISM lines (scientific name, classification array)
3)  'source' seqfeature (strain/variant (!), organelle, TaxID, etc found
here).  

We should assign those accordingly; we could even use the 'source'
seqfeature to grab strain, organelle, etc. just like we now do for the
TaxID.

Beyond that we're really just guessing the ranks and the genus-species
names.  Makes no sense, especially when that is easily available in
Bio::Taxonomy using entrez/flatfile.  We could have Bio::Taxonomy::Species
act as a container for IO purpose, ONLY using the methods in the 'reliable
information' list above in Bio::SeqIO::genbank and other SeqIO RichSeqs.
Then hold the additional data with warnings attached if a lookup hasn't been
run, or not set them at all.  Or, use Hilmar's suggestion and force the user
to use the db handle and ncbi_taxid() to grab a new
Bio::Taxonomy::Node/Species object (based on the rank) which has the correct
information.  

As for the other container get/sets: species(), genus() etc.

These methods should be present, but only for species or below (hence
Bio::Taxonomy::Species).  In a way Bio::Taxonomy::Species is not entirely
correct as the sequence file many times the sequence is from an organism at
the genus level (unassigned species) or subspecies/strain levels, or is
unranked (environmental samples, for instance).  All of these seem to have
TaxIDs though.  Don't think it really matters...

We could convert Bio::Species into an abstract interface class
(Bio::SpeciesI), moving the implemented methods over to
Bio::Taxonomy::Species, and have Bio::Taxonomy::Species implement
Bio::Taxonomy::NodeI or Bio::TaxonomyI as well.  Bio::Taxonomy::Species
could be checked with

$obj->isa('Bio::TaxonomyI') && $obj->isa('Bio::SpeciesI')

Or, modifying Hilmar's suggestion:

            |-----Tax::Node
NodeI/TaxI -|
            |-----Tax::Species
                |
SpeciesI -------|

So Species doesn't 'contaminate' Node.

This will allow you to proceed with doing what you want to
Bio::Taxonomy::Node; both Node and Species could be checked simultaneously
though they need to be changed at some point to implement the same base
class, so you could check using :

if ($obj->isa('Bio::Taxonomy::NodeI')) {

As for getting Bio::SeqIO::genbank to play well with Bio::Taxonomy::Species,
all I did was 'clone' the Bio::Taxonomy::Node module into
Bio::Taxonomy::Species, removed the warnings in species() and other methods
for the time being, and changed the method call for classification() in
Bio::SeqIO::genbank to send an array instead of an array_ref.  Then I
modified the parsing to retain the scientific_name and abbreviated_name
(though the latter should go into common_names()).  Passed all but one test,
where common_name was called and returned the entire SOURCE line (not
correct!).  Pretty simple, really...

BTW, I checked EMBL format, and it is very similar in format to the way
GenBank is with the interesting addition of the OG line (for organelle).  

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Monday, July 24, 2006 8:53 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
> 
> Chris Fields wrote:
> > Bio::SeqIO::genbank works very happily with the current
> > Bio::Taxonomy::Node now; if we intend to remove most of the method we
> > need to have a similar DB-aware module to house the flatfile data (like
> > Bio::Species) yet be capable of working with Bio::Taxonomy (like
> Tax::Node).
> 
> Can you give code examples of what Bio::SeqIO::genbank is doing and what
> makes it 'happy'? What are the requirements? Would it be as happy
> working with a Bio::Taxonomy object?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l