[Bioperl-l] Bio::Taxonomy changes

Tue Jul 25 15:36:40 UTC 2006

> On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote:
> 
> > [...]
> > ## the fully-manual way
> > my $species = new Bio::Species;
> > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces
> > cerevisiae',
> >                                     -rank => 'species', -object_id
> > => 1,
> >                                     -parent_id => 2);
> 
> If this is meant as an example for the use cases I enumerated, then
> you wouldn't have the parent_id from a Genbank file. However, you
> didn't have that before either, so no problem.
> 
> > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> >                                   -object_id => 2, -parent_id => 3);
> > # (no assumption that 'Saccharomyces' is the genus, so rank()
> > undefined)
> 
> I think in a confident parse you want to assign 'genus' if there's
> little doubt, for example 'Saccharomyces cerevisiae'. Not sure
> whether there are weird viri whose names look innocuous but in
> reality the name doesn't follow binomial convention.
> 
> > my $n3 = [etc]
> > $species->add_node($node);
> > $species->add_node($n2);
> 
> I know why you are doing this, but seeing this people will hit a
> mental snag. You should listen to Chris' refusal to see the sense in
> this as an indication that many people down the road won't see the
> sense either.

Thanks for pointing that out.  I think there is only a small, fundamental
difference in our views here.  I'm trying to view this as an outsider would,
a biologist not familiar with the Bioperl class structure.  I understand
what Sendu's trying to accomplish but it's really confusing to someone not
familiar with what Bio::Species is.

Hilmar, you had pointed out several times that Bio::Species and
Bio::Taxonomy shouldn't directly intermingle.  

My original thought for genbank.pm _read_GenBank_Species() was this, copied
and pasted from my local genbank.pm.  It's sort of extreme, but it passes
tests just fine.

sub _read_GenBank_Species {
	my( $self,$buffer) = @_;
	$_ = $$buffer;
	my @organelles = qw(plastid chloroplast mitochondrion);
	my( $source_data, $common_name, @class, $ns_name, $organelle,
	   $source_flag, $sci_name, $abbr );
	while (defined($_) || defined($_ = $self->_readline())) {
		# de-HTMLify (links that may be encountered here don't
contain
		# escaped '>', so a simple-minded approach suffices)
		s/<[^>]+>//g;
		if ( /^SOURCE\s+(.*)/o ) {
			$source_data = $1;
			$source_data =~ s/\.$//; # remove trailing dot
			# does it have a GenBank common name in parentheses?
			$common_name = $source_data =~ m{\((.*)\)}xms;
			# organelle?  If we find additional odd ones, 
                  # add to @organelle
			$organelle = grep {	$_ =~ $source_data }
@organelles;
			$source_flag = 1; 
		} elsif ( /^\s{2}ORGANISM\s+(.*)/o ) {
			$sci_name = $1;
			$source_flag = 0;
		} elsif ($source_flag) { # no ORGANISM
			$common_name .= $source_data;
			$common_name =~ s/\n//g;
			$common_name =~ s/\s+/ /g;
			$source_flag = 0;
		} elsif ( /^\s+(.+)/o ) { # lineage information
			my $line = $1;
			# only split on ';' or '.' so that classification 
                  # that is 2 words will still get matched, use 
                  # map() to remove trailing/leading spaces
			push(@class, map { s/^\s+//; s/\s+$//; $_; } 
                    split /[;\.]+/, $line)
			  if ( $line =~ /(;|\.)/ );
		} else { # reach end of GenBank tax info
			last;
		}
		$_ = undef; # Empty $_ to trigger read of next line
	}
	$$buffer = $_;
	@class = reverse @class;
	my $make = Bio::Taxonomy::Node->new();
	$make->common_name( $common_name ) if $common_name;
	$make->scientific_name($sci_name) if $sci_name;
      # could use SimpleValue objs here instead
	$make->classification( @class ) if @class;
	$make->organelle($organelle) if $organelle; 
	return $make;
} 

# back in next_seq...grab the TaxID from 'source'
# seqfeature
# could check organelle() here as well

    # add taxon_id from source if available
    if($species && ($feat->primary_tag eq 'source') &&
        $feat->has_tag('db_xref') && (! $species->ncbi_taxid())) {
        foreach my $tagval ($feat->get_tag_values('db_xref')) {
            if(index($tagval,"taxon:") == 0) {
                $species->ncbi_taxid(substr($tagval,6));
                last;
            }
        }
    }

In other words, remove the extra parsing of genus() species() subspecies
etc.  All GenBank sequences have a node represented in NCBI's tax database
(I checked it out).  Even plasmids, unknowns, environmental samples.

Chris

> So instead, make the logical model in your design more obvious, which
> I think ultimately will help maintainability as well. For example:
> 
> my $taxonomy = Bio::Taxonomy->new();
> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
>                                      -rank => 'species', -object_id
> => 1,
>                                      -parent_id => 2);
> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
>                                    -object_id => 2, -parent_id => 3);
> $taxonomy->add_node($node);
> $taxonomy->add_node($n2);
> 
> my $species = Bio::Species->new(-lineage => $taxonomy);
> print $species->binomial();
> print $species->genus();
> # this may trigger a lookup if a taxonomy db handle has been set, e.g.:
> # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez'));
> print $species->classification();
> 
> 
> > [etc]
> >
> > ## Using a factory without db access
> > # assume that Bio::Taxonomy::GenbankFactory implements
> > # some modified Bio::Taxonomy::FactoryI
> > my $factory = Bio::Taxonomy::GenbankFactory->new();
> > my $species = $factory->generate(-classification => ['Saccharomyces
> >               cerevisiae', 'Saccharomyces',
> > 'Saccharomycetaceae' ...]);
> > # the generate() method above just does the fully-manual way for you
> 
> Except the method name would be create_object(), the parameter would
> be a hash ref, and the return value would be a Bio::TaxonomyI
> compliant object:
> 
> my $taxonomy = $factory->create_object({-classification =>
> ['Saccharomyces
>                cerevisiae', 'Saccharomyces',
> 'Saccharomycetaceae' ...]});
> my $species = Bio::Species->new(-lineage => $taxonomy);
> 
> 
> >
> > ## Using a factory with db access
> > # assume that Bio::Taxonomy::EntrezFactory implements some
> > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez
> > # to get the nodes
> > my $factory = Bio::Taxonomy::EntrezFactory->new();
> 
> The logic where to do a lookup on should not be duplicated here. It
> only belongs under Bio::DB::Taxonomy::*.
> 
> > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces
> >                                                     cerevisiae');
> 
> Likewise, use the methods defined in Bio::DB::Taxonomy, and again,
> the return type is Bio::Taxonomy, which you would pass to
> Bio::Species->new().
> 
> 	-hilmar
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
> 
> 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

[Bioperl-l] Bio::*Taxonomy* changes

[Bioperl-l] Bio::Taxonomy changes