[Bioperl-l] about common_name

Qiang Tu tuqiang@mail.shcnc.ac.cn
Fri, 22 Nov 2002 15:30:31 GMT


hello all,

Sorry to bother you.

I found a problem of Bio::Species. If you load a sequence and want to read 
common name of the species of the sequence, you may use 
$seq->species->common_name. But many sequences do not carry correct
common names so you can not get the correct names from this method.
I think it may be solved by query taxnomy database from NCBI and write 
a prototype function. Should we add such a method in Bio::Species?
thanks.

run the script on some sequences and the result is:
==========
bname  is: Bos taurus
cname1 is: Bos taurus (cow)
cname2 is: cow

bname  is: Saccharomyces cerevisiae
cname1 is: Saccharomyces cerevisiae
cname2 is: baker's yeast

bname  is: Mus musculus
cname1 is: Mus musculus
cname2 is: house mouse

bname  is: Homo sapiens
cname1 is: Homo sapiens (human)
cname2 is: human

==========

and the script is:

==========
#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;
use LWP::Simple;

my $file = shift;
my $io = Bio::SeqIO->new( '-file' => $file,
                          '-format' => 'genbank',
                         );
my $seq = $io->next_seq;
my $bname  = $seq->species->binomial;
my $cname1 = $seq->species->common_name;
my $cname2 = ncbi_common_name($bname);

print "bname  is: $bname \n";
print "cname1 is: $cname1\n";
print "cname2 is: $cname2\n";

sub ncbi_common_name {

    my $bname    = shift or return;

    my $utils    = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
    my $esearch  = "$utils/esearch.fcgi?db=taxonomy&term=";
    my $esummary = "$utils/esummary.fcgi?db=taxonomy&id=";
    my $countid1 = '<eSearchResult>.*?<Count>';
    my $countid2 = '</Count>';
    my $id1      = '<Id>';
    my $id2      = '</Id>';
    my $cnameid1 = '<Item.*?CommonName.*?>';
    my $cnameid2 = '</Item>';

    $bname =~ s/\s+/+/g;
    $bname = '"'.$bname.'"';

    my $esearch_result = get($esearch . $bname) or return;

    my $count;
    if ($esearch_result =~ /$countid1(\d+)$countid2/s) {
        $count = $1;
    }
    return if ($count != 1);

    my $id;
    if ($esearch_result =~ /$id1(\d+)$id2/) {
        $id = $1;
    }
    return if (!$id);

    my $esummary_result = get($esummary . $id) or return;

    my $cname;
    if ($esummary_result =~ /$cnameid1(.*?)$cnameid2/) {
        $cname = $1;
    }

    return $cname;
}


==========

 
 
Qiang Tu
Institute of Biochemistry and Cell Biology
Chinese Academy of Sciences
Email: tuqiang@mail.shcnc.ac.cn, tuqiang_cn@yahoo.com