[Bioperl-l] Bio::*Taxonomy* changes
Chris Fields
cjfields at uiuc.edu
Tue Jul 25 11:36:40 EDT 2006
> On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote:
>
> > [...]
> > ## the fully-manual way
> > my $species = new Bio::Species;
> > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces
> > cerevisiae',
> > -rank => 'species', -object_id
> > => 1,
> > -parent_id => 2);
>
> If this is meant as an example for the use cases I enumerated, then
> you wouldn't have the parent_id from a Genbank file. However, you
> didn't have that before either, so no problem.
>
> > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> > -object_id => 2, -parent_id => 3);
> > # (no assumption that 'Saccharomyces' is the genus, so rank()
> > undefined)
>
> I think in a confident parse you want to assign 'genus' if there's
> little doubt, for example 'Saccharomyces cerevisiae'. Not sure
> whether there are weird viri whose names look innocuous but in
> reality the name doesn't follow binomial convention.
>
> > my $n3 = [etc]
> > $species->add_node($node);
> > $species->add_node($n2);
>
> I know why you are doing this, but seeing this people will hit a
> mental snag. You should listen to Chris' refusal to see the sense in
> this as an indication that many people down the road won't see the
> sense either.
Thanks for pointing that out. I think there is only a small, fundamental
difference in our views here. I'm trying to view this as an outsider would,
a biologist not familiar with the Bioperl class structure. I understand
what Sendu's trying to accomplish but it's really confusing to someone not
familiar with what Bio::Species is.
Hilmar, you had pointed out several times that Bio::Species and
Bio::Taxonomy shouldn't directly intermingle.
My original thought for genbank.pm _read_GenBank_Species() was this, copied
and pasted from my local genbank.pm. It's sort of extreme, but it passes
tests just fine.
sub _read_GenBank_Species {
my( $self,$buffer) = @_;
$_ = $$buffer;
my @organelles = qw(plastid chloroplast mitochondrion);
my( $source_data, $common_name, @class, $ns_name, $organelle,
$source_flag, $sci_name, $abbr );
while (defined($_) || defined($_ = $self->_readline())) {
# de-HTMLify (links that may be encountered here don't
contain
# escaped '>', so a simple-minded approach suffices)
s/<[^>]+>//g;
if ( /^SOURCE\s+(.*)/o ) {
$source_data = $1;
$source_data =~ s/\.$//; # remove trailing dot
# does it have a GenBank common name in parentheses?
$common_name = $source_data =~ m{\((.*)\)}xms;
# organelle? If we find additional odd ones,
# add to @organelle
$organelle = grep { $_ =~ $source_data }
@organelles;
$source_flag = 1;
} elsif ( /^\s{2}ORGANISM\s+(.*)/o ) {
$sci_name = $1;
$source_flag = 0;
} elsif ($source_flag) { # no ORGANISM
$common_name .= $source_data;
$common_name =~ s/\n//g;
$common_name =~ s/\s+/ /g;
$source_flag = 0;
} elsif ( /^\s+(.+)/o ) { # lineage information
my $line = $1;
# only split on ';' or '.' so that classification
# that is 2 words will still get matched, use
# map() to remove trailing/leading spaces
push(@class, map { s/^\s+//; s/\s+$//; $_; }
split /[;\.]+/, $line)
if ( $line =~ /(;|\.)/ );
} else { # reach end of GenBank tax info
last;
}
$_ = undef; # Empty $_ to trigger read of next line
}
$$buffer = $_;
@class = reverse @class;
my $make = Bio::Taxonomy::Node->new();
$make->common_name( $common_name ) if $common_name;
$make->scientific_name($sci_name) if $sci_name;
# could use SimpleValue objs here instead
$make->classification( @class ) if @class;
$make->organelle($organelle) if $organelle;
return $make;
}
# back in next_seq...grab the TaxID from 'source'
# seqfeature
# could check organelle() here as well
# add taxon_id from source if available
if($species && ($feat->primary_tag eq 'source') &&
$feat->has_tag('db_xref') && (! $species->ncbi_taxid())) {
foreach my $tagval ($feat->get_tag_values('db_xref')) {
if(index($tagval,"taxon:") == 0) {
$species->ncbi_taxid(substr($tagval,6));
last;
}
}
}
In other words, remove the extra parsing of genus() species() subspecies
etc. All GenBank sequences have a node represented in NCBI's tax database
(I checked it out). Even plasmids, unknowns, environmental samples.
Chris
> So instead, make the logical model in your design more obvious, which
> I think ultimately will help maintainability as well. For example:
>
> my $taxonomy = Bio::Taxonomy->new();
> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae',
> -rank => 'species', -object_id
> => 1,
> -parent_id => 2);
> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces',
> -object_id => 2, -parent_id => 3);
> $taxonomy->add_node($node);
> $taxonomy->add_node($n2);
>
> my $species = Bio::Species->new(-lineage => $taxonomy);
> print $species->binomial();
> print $species->genus();
> # this may trigger a lookup if a taxonomy db handle has been set, e.g.:
> # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez'));
> print $species->classification();
>
>
> > [etc]
> >
> > ## Using a factory without db access
> > # assume that Bio::Taxonomy::GenbankFactory implements
> > # some modified Bio::Taxonomy::FactoryI
> > my $factory = Bio::Taxonomy::GenbankFactory->new();
> > my $species = $factory->generate(-classification => ['Saccharomyces
> > cerevisiae', 'Saccharomyces',
> > 'Saccharomycetaceae' ...]);
> > # the generate() method above just does the fully-manual way for you
>
> Except the method name would be create_object(), the parameter would
> be a hash ref, and the return value would be a Bio::TaxonomyI
> compliant object:
>
> my $taxonomy = $factory->create_object({-classification =>
> ['Saccharomyces
> cerevisiae', 'Saccharomyces',
> 'Saccharomycetaceae' ...]});
> my $species = Bio::Species->new(-lineage => $taxonomy);
>
>
> >
> > ## Using a factory with db access
> > # assume that Bio::Taxonomy::EntrezFactory implements some
> > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez
> > # to get the nodes
> > my $factory = Bio::Taxonomy::EntrezFactory->new();
>
> The logic where to do a lookup on should not be duplicated here. It
> only belongs under Bio::DB::Taxonomy::*.
>
> > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces
> > cerevisiae');
>
> Likewise, use the methods defined in Bio::DB::Taxonomy, and again,
> the return type is Bio::Taxonomy, which you would pass to
> Bio::Species->new().
>
> -hilmar
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list