[Bioperl-l] how to retrieve organism name from accession number?

Tue Jan 26 21:42:22 EST 2010

Makes me wonder if they're pushing more users towards the SOAP-based services and away from eutils.

chris

On Jan 26, 2010, at 7:59 PM, Smithies, Russell wrote:

> I've had a wide selection of errors lately:
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: NCBI esearch fatal error: Search Backend failed: Error 11 (Resource temporarily unavailable)
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
> STACK: Bio::Tools::EUtilities::parse_data /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
> STACK: Bio::Tools::EUtilities::get_ids /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
> STACK: Bio::DB::EUtilities::get_ids /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
> STACK: get_desc.pl:32
> -----------------------------------------------------------
> 
> And I never get a good explanation from NCBI or suggestions on how to avoid it.
> 
> 
> --Russell
> 	
> 
>> -----Original Message-----
>> From: Chris Fields [mailto:cjfields at illinois.edu]
>> Sent: Wednesday, 27 January 2010 2:46 p.m.
>> To: Smithies, Russell
>> Cc: 'Mark A. Jensen'; 'bioperl-l at lists.open-bio.org'
>> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
>> number?
>> 
>> It's unfortunate but I have heard this problem popping up quite a bit more
>> frequently lately.  Not to push too many buttons but NCBI isn't very
>> forthcoming with help these days; they have become quite insular.  Not
>> sure if they're short-staffed due to budget or if there are other issues.
>> 
>> chris
>> 
>> On Jan 26, 2010, at 7:40 PM, Smithies, Russell wrote:
>> 
>>> Grrrrrr, I hate eutils!!!!
>>> 
>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>> MSG: NCBI esearch fatal error: Search Backend failed: Error 111
>> (Connection refused)
>>> STACK: Error::throw
>>> STACK: Bio::Root::Root::throw
>> /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
>>> STACK: Bio::Tools::EUtilities::parse_data
>> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
>>> STACK: Bio::Tools::EUtilities::get_ids
>> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
>>> STACK: Bio::DB::EUtilities::get_ids
>> /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
>>> STACK: get_desc.pl:32
>>> -----------------------------------------------------------
>>> 
>>> 
>>> Nice error message though :-)
>>> 
>>> 
>>> --Russell
>>> 
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
>>>> Sent: Monday, 11 January 2010 10:05 a.m.
>>>> To: 'Chris Fields'
>>>> Cc: 'Bhakti Dwivedi'; 'Mark A. Jensen'; 'bioperl-l at lists.open-bio.org'
>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
>>>> number?
>>>> 
>>>> I've started to go off eUtils recently (not BioPerl's fault) as I've
>> often
>>>> been finding that with large queries, chunks of the resulting data is
>>>> missing.
>>>> For example, before Xmas I was creating species-specific databases by
>>>> using eUtils to get a list of GI numbers back for a taxid, then
>> retrieving
>>>> the fasta sequences in chunks of 500.
>>>> Very regularly, in the middle of the fasta there would be a message
>> about
>>>> resource unavailable eg.
>>>>> test_sequence_1
>>>> TACGATCATCGCTResource UnavailableTACGACTCTGCT
>>>>> test_sequence_2
>>>> TACGTACTACGATCGATCATCACTATCGTCATACTACTACTGACT
>>>> 
>>>> Often this wasn't detected until formatdb complained about invalid
>>>> characters.
>>>> Inquiries to NCBI as to why this was happening and what to do about it
>>>> returned stupid answers ("do each sequence manually thru the web
>>>> interface", or "use eUtils").
>>>> As we have a nice fast network connection, I now prefer to download
>> very
>>>> large gzip files (i.e. all of refseq) and extract what I need.
>>>> 
>>>> I can't help but think that NCBI could solve a lot of problems if they
>>>> gzipped the output from eUtils queries - it's something I've requested
>>>> regularly for the last 5 years or so!!
>>>> 
>>>> --Russell
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>>> Sent: Monday, 11 January 2010 9:50 a.m.
>>>>> To: Smithies, Russell
>>>>> Cc: 'Mark A. Jensen'; 'Bhakti Dwivedi'; 'bioperl-l at lists.open-bio.org'
>>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
>>>>> number?
>>>>> 
>>>>> One could also use Bio::DB::Taxonomy, which indexes the same files or
>>>>> (alternatively) makes the eutil calls (see Bio::DB::Taxonomy POD for
>> the
>>>>> details).
>>>>> 
>>>>> chris
>>>>> 
>>>>> On Jan 10, 2010, at 2:34 PM, Smithies, Russell wrote:
>>>>> 
>>>>>> An alternate non-BioPerly way (that may be faster given NCBI's
>>>> flakiness
>>>>> lately) would be to download the gi_taxid_nucl.zip or
>> gi_taxid_prot.zip
>>>>> files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, load them into a hash
>>>> and
>>>>> do lookups.
>>>>>> In that same dir, taxdump.tar.gz contains a file called names.dmp
>>>> which
>>>>> lists taxids and descriptions (and synonyms)
>>>>>> 
>>>>>> If it was me, I'd split gi_taxid_nucl and names.dmp into hashes so I
>>>>> could do this:
>>>>>> 
>>>>>> my $taxid  = $gi_taxid_nucl{$accession};
>>>>>> my $org_name = $names{$taxid};
>>>>>> 
>>>>>> --Russell
>>>>>> 
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>>>>> bounces at lists.open-bio.org] On Behalf Of Mark A. Jensen
>>>>>>> Sent: Saturday, 26 December 2009 4:52 p.m.
>>>>>>> To: Bhakti Dwivedi; bioperl-l at lists.open-bio.org
>>>>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from
>> accession
>>>>>>> number?
>>>>>>> 
>>>>>>> Bhakti,
>>>>>>> The following example (using EUtilities) may serve your purpose:
>>>>>>> 
>>>>>>> use Bio::DB::EUtilities;
>>>>>>> 
>>>>>>> my (%taxa, @taxa);
>>>>>>> my (%names, %idmap);
>>>>>>> 
>>>>>>> # these are protein ids; nuc ids will work by changing -dbfrom =>
>>>>>>> 'nucleotide',
>>>>>>> # (probably)
>>>>>>> 
>>>>>>> my @ids = qw(1621261 89318838 68536103 20807972 730439);
>>>>>>> 
>>>>>>> my $factory = Bio::DB::EUtilities->new(-eutil => 'elink',
>>>>>>>                                     -db => 'taxonomy',
>>>>>>>                                     -dbfrom => 'protein',
>>>>>>>                                     -correspondence => 1,
>>>>>>>                                     -id => \@ids);
>>>>>>> 
>>>>>>> # iterate through the LinkSet objects
>>>>>>> while (my $ds = $factory->next_LinkSet) {
>>>>>>>  $taxa{($ds->get_submitted_ids)[0]} = ($ds->get_ids)[0]
>>>>>>> }
>>>>>>> 
>>>>>>> @taxa = @taxa{@ids};
>>>>>>> 
>>>>>>> $factory = Bio::DB::EUtilities->new(-eutil => 'esummary',
>>>>>>>      -db    => 'taxonomy',
>>>>>>>      -id    => \@taxa );
>>>>>>> 
>>>>>>> while (local $_ = $factory->next_DocSum) {
>>>>>>>  $names{($_->get_contents_by_name('TaxId'))[0]} =
>>>>>>> ($_->get_contents_by_name('ScientificName'))[0];
>>>>>>> }
>>>>>>> 
>>>>>>> foreach (@ids) {
>>>>>>>  $idmap{$_} = $names{$taxa{$_}};
>>>>>>> }
>>>>>>> 
>>>>>>> # %idmap is
>>>>>>> #    1621261 => 'Mycobacterium tuberculosis H37Rv'
>>>>>>> #    20807972 => 'Thermoanaerobacter tengcongensis MB4'
>>>>>>> #    68536103 => 'Corynebacterium jeikeium K411'
>>>>>>> #    730439 => 'Bacillus caldolyticus'
>>>>>>> #    89318838 => undef    (this record has been removed from the db)
>>>>>>> 
>>>>>>> 1;
>>>>>>> 
>>>>>>> You probably will need to break up your 30000 into chunks
>>>>>>> (say, 1000-3000 each), and do the above on each chunk with a
>>>>>>> 
>>>>>>> sleep 3;
>>>>>>> 
>>>>>>> or so separating the queries.
>>>>>>> MAJ
>>>>>>> ----- Original Message -----
>>>>>>> From: "Bhakti Dwivedi" <bhakti.dwivedi at gmail.com>
>>>>>>> To: <bioperl-l at lists.open-bio.org>
>>>>>>> Sent: Friday, December 25, 2009 9:46 PM
>>>>>>> Subject: [Bioperl-l] how to retrieve organism name from accession
>>>>> number?
>>>>>>> 
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Does anyone know how to retrieve the "Source" or the "Species name"
>>>>>>> given
>>>>>>>> the accession number using Bioperl.   I have these 30,000 accession
>>>>>>> numbers
>>>>>>>> for which I need to get the source organisms.  Any kind of help
>> will
>>>>> be
>>>>>>>> appreciated.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> BD
>>>>>>>> _______________________________________________
>>>>>>>> Bioperl-l mailing list
>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Bioperl-l mailing list
>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>> 
>>>> =======================================================================
>>>>>> Attention: The information contained in this message and/or
>>>> attachments
>>>>>> from AgResearch Limited is intended only for the persons or entities
>>>>>> to which it is addressed and may contain confidential and/or
>>>> privileged
>>>>>> material. Any review, retransmission, dissemination or other use of,
>>>> or
>>>>>> taking of any action in reliance upon, this information by persons or
>>>>>> entities other than the intended recipients is prohibited by
>>>> AgResearch
>>>>>> Limited. If you have received this message in error, please notify
>> the
>>>>>> sender immediately.
>>>>>> 
>>>> =======================================================================
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l