[Bioperl-l] Bio::Taxonomy changes

Fri Jul 21 04:51:30 UTC 2006

> I didn't actually mean a stored file (but that would be possible  
> with a
> tied hash or something: DB_File, just like flatfile), but an in-memory
> one for use during the course of program execution. Stored file would
> probably be dangerous because you wouldn't know if the data has become
> stale or not - and checking to see if it wasn't would defeat the  
> point.

Okay, that wouldn't be a problem.  I currently use in-memory caches  
to hold NCBI history information and ELink information for  
EUtilities.  It would just a matter of doing the same for  
Bio::DB::Taxonomy.

...

> entrez already parses through LineageEx to build the classification
> array. flatfile walks up all the parents to do the same. Having the
> information isn't the issue. We have the information. The methods
> genus() and species() need to work with the genbank fileformat,  
> that is
> the problem.

The original purpose for Bio::Species was a simple object to hold  
taxonomic information.  This object was then used in an attempt to  
hold the basic organism information (scientific name, common name,  
lineage information, etc) contained in a RichSeq file, like GenBank,  
EMBL, SwissProt, etc.   The problem: trying to determine which term  
in the lineage corresponds to which rank and what part of the  
organism's scientific name is the genus, the species, and so on based  
solely on the data in the file, which comes down to a best-guess  
scenario for many organisms.   It does work, but not equally well for  
all RichSeq files, not for every organism, and definitely not all the  
time.  So, yes, genus(), species(), binomial, and other methods are  
present, but one must realize that parsing out the data into the  
appropriate object data using the various get/sets, with the obvious  
exceptions, is not the best way.

Unless... you incorporate information available only outside the  
actual file itself (i.e. NCBI Taxonomy information).  This is where  
Bio::Taxonomy seems to come along, as it's not-species specific (it  
can represent any rank) and is also DB-aware.  Though Bio::Species  
was originally going to delegate all its data to Bio::Taxonomy::Node,  
I think the purpose was to eventually replace Bio::Species.

So, my question is, why not use a Bio::Taxonomy::Node-like class  
initially to contain the appropriate data for a GenBank file (just  
for read/write purposes)?  This object, since it implements  
Bio::Taxonomy::NodeI, is also DB-aware and thus, if set up with a  
database could also get/set the appropriate object data correctly  
using the lineage data.  So, for instance, if I called

$species = $seq->species();

and wanted the classification, scientific_name(), common_name, and  
other information that is gleaned from the file, then there's no need  
for a lookup.  Once you cross into the bounds of:

print $species->species();
print $species->genus();

then there's trouble, since we're working straight from the file  
(i.e. parsing is mainly correct, but still guesswork and sometimes  
wrong).  But what if you could do something like this:

my $db = Bio::DB::Taxonomy->new(-source => 'entrez');

# normally not needed as this is set by default internally, but as a  
demo here...
$species->db_handle($db);

# reset the appropriate data (genus, species, etc) based on Entrez  
tax data
$species->reset_data();     # this method, BTW, doesn't exist yet but  
should be easy to implement

print $species->species();
my $parent = $species->get_Parent_Node;

my @child = $species->get_Children_Nodes;

...and so on

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign

[Bioperl-l] Bio::*Taxonomy* changes

[Bioperl-l] Bio::Taxonomy changes