[Bioperl-l] how to retrieve organism name from accession number?

Tue Jan 26 20:40:40 EST 2010

Grrrrrr, I hate eutils!!!!

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: NCBI esearch fatal error: Search Backend failed: Error 111 (Connection refused)
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
STACK: Bio::Tools::EUtilities::parse_data /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
STACK: Bio::Tools::EUtilities::get_ids /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
STACK: Bio::DB::EUtilities::get_ids /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
STACK: get_desc.pl:32
-----------------------------------------------------------

Nice error message though :-)

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> Sent: Monday, 11 January 2010 10:05 a.m.
> To: 'Chris Fields'
> Cc: 'Bhakti Dwivedi'; 'Mark A. Jensen'; 'bioperl-l at lists.open-bio.org'
> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> number?
> 
> I've started to go off eUtils recently (not BioPerl's fault) as I've often
> been finding that with large queries, chunks of the resulting data is
> missing.
> For example, before Xmas I was creating species-specific databases by
> using eUtils to get a list of GI numbers back for a taxid, then retrieving
> the fasta sequences in chunks of 500.
> Very regularly, in the middle of the fasta there would be a message about
> resource unavailable eg.
>   >test_sequence_1
>   TACGATCATCGCTResource UnavailableTACGACTCTGCT
>   >test_sequence_2
>   TACGTACTACGATCGATCATCACTATCGTCATACTACTACTGACT
> 
> Often this wasn't detected until formatdb complained about invalid
> characters.
> Inquiries to NCBI as to why this was happening and what to do about it
> returned stupid answers ("do each sequence manually thru the web
> interface", or "use eUtils").
> As we have a nice fast network connection, I now prefer to download very
> large gzip files (i.e. all of refseq) and extract what I need.
> 
> I can't help but think that NCBI could solve a lot of problems if they
> gzipped the output from eUtils queries - it's something I've requested
> regularly for the last 5 years or so!!
> 
> --Russell
> 
> 
> > -----Original Message-----
> > From: Chris Fields [mailto:cjfields at illinois.edu]
> > Sent: Monday, 11 January 2010 9:50 a.m.
> > To: Smithies, Russell
> > Cc: 'Mark A. Jensen'; 'Bhakti Dwivedi'; 'bioperl-l at lists.open-bio.org'
> > Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> > number?
> >
> > One could also use Bio::DB::Taxonomy, which indexes the same files or
> > (alternatively) makes the eutil calls (see Bio::DB::Taxonomy POD for the
> > details).
> >
> > chris
> >
> > On Jan 10, 2010, at 2:34 PM, Smithies, Russell wrote:
> >
> > > An alternate non-BioPerly way (that may be faster given NCBI's
> flakiness
> > lately) would be to download the gi_taxid_nucl.zip or gi_taxid_prot.zip
> > files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, load them into a hash
> and
> > do lookups.
> > > In that same dir, taxdump.tar.gz contains a file called names.dmp
> which
> > lists taxids and descriptions (and synonyms)
> > >
> > > If it was me, I'd split gi_taxid_nucl and names.dmp into hashes so I
> > could do this:
> > >
> > >   my $taxid  = $gi_taxid_nucl{$accession};
> > >   my $org_name = $names{$taxid};
> > >
> > > --Russell
> > >
> > >
> > >> -----Original Message-----
> > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > >> bounces at lists.open-bio.org] On Behalf Of Mark A. Jensen
> > >> Sent: Saturday, 26 December 2009 4:52 p.m.
> > >> To: Bhakti Dwivedi; bioperl-l at lists.open-bio.org
> > >> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> > >> number?
> > >>
> > >> Bhakti,
> > >> The following example (using EUtilities) may serve your purpose:
> > >>
> > >> use Bio::DB::EUtilities;
> > >>
> > >> my (%taxa, @taxa);
> > >> my (%names, %idmap);
> > >>
> > >> # these are protein ids; nuc ids will work by changing -dbfrom =>
> > >> 'nucleotide',
> > >> # (probably)
> > >>
> > >> my @ids = qw(1621261 89318838 68536103 20807972 730439);
> > >>
> > >> my $factory = Bio::DB::EUtilities->new(-eutil => 'elink',
> > >>                                       -db => 'taxonomy',
> > >>                                       -dbfrom => 'protein',
> > >>                                       -correspondence => 1,
> > >>                                       -id => \@ids);
> > >>
> > >> # iterate through the LinkSet objects
> > >> while (my $ds = $factory->next_LinkSet) {
> > >>    $taxa{($ds->get_submitted_ids)[0]} = ($ds->get_ids)[0]
> > >> }
> > >>
> > >> @taxa = @taxa{@ids};
> > >>
> > >> $factory = Bio::DB::EUtilities->new(-eutil => 'esummary',
> > >>        -db    => 'taxonomy',
> > >>        -id    => \@taxa );
> > >>
> > >> while (local $_ = $factory->next_DocSum) {
> > >>    $names{($_->get_contents_by_name('TaxId'))[0]} =
> > >> ($_->get_contents_by_name('ScientificName'))[0];
> > >> }
> > >>
> > >> foreach (@ids) {
> > >>    $idmap{$_} = $names{$taxa{$_}};
> > >> }
> > >>
> > >> # %idmap is
> > >> #    1621261 => 'Mycobacterium tuberculosis H37Rv'
> > >> #    20807972 => 'Thermoanaerobacter tengcongensis MB4'
> > >> #    68536103 => 'Corynebacterium jeikeium K411'
> > >> #    730439 => 'Bacillus caldolyticus'
> > >> #    89318838 => undef    (this record has been removed from the db)
> > >>
> > >> 1;
> > >>
> > >> You probably will need to break up your 30000 into chunks
> > >> (say, 1000-3000 each), and do the above on each chunk with a
> > >>
> > >> sleep 3;
> > >>
> > >> or so separating the queries.
> > >> MAJ
> > >> ----- Original Message -----
> > >> From: "Bhakti Dwivedi" <bhakti.dwivedi at gmail.com>
> > >> To: <bioperl-l at lists.open-bio.org>
> > >> Sent: Friday, December 25, 2009 9:46 PM
> > >> Subject: [Bioperl-l] how to retrieve organism name from accession
> > number?
> > >>
> > >>
> > >>> Hi,
> > >>>
> > >>> Does anyone know how to retrieve the "Source" or the "Species name"
> > >> given
> > >>> the accession number using Bioperl.   I have these 30,000 accession
> > >> numbers
> > >>> for which I need to get the source organisms.  Any kind of help will
> > be
> > >>> appreciated.
> > >>>
> > >>> Thanks
> > >>>
> > >>> BD
> > >>> _______________________________________________
> > >>> Bioperl-l mailing list
> > >>> Bioperl-l at lists.open-bio.org
> > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >>>
> > >>>
> > >>
> > >> _______________________________________________
> > >> Bioperl-l mailing list
> > >> Bioperl-l at lists.open-bio.org
> > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> =======================================================================
> > > Attention: The information contained in this message and/or
> attachments
> > > from AgResearch Limited is intended only for the persons or entities
> > > to which it is addressed and may contain confidential and/or
> privileged
> > > material. Any review, retransmission, dissemination or other use of,
> or
> > > taking of any action in reliance upon, this information by persons or
> > > entities other than the intended recipients is prohibited by
> AgResearch
> > > Limited. If you have received this message in error, please notify the
> > > sender immediately.
> > >
> =======================================================================
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l