[Bioperl-l] how to retrieve organism name from accession number?
Chris Fields
cjfields at illinois.edu
Tue Jan 26 20:46:26 EST 2010
It's unfortunate but I have heard this problem popping up quite a bit more frequently lately. Not to push too many buttons but NCBI isn't very forthcoming with help these days; they have become quite insular. Not sure if they're short-staffed due to budget or if there are other issues.
chris
On Jan 26, 2010, at 7:40 PM, Smithies, Russell wrote:
> Grrrrrr, I hate eutils!!!!
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: NCBI esearch fatal error: Search Backend failed: Error 111 (Connection refused)
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
> STACK: Bio::Tools::EUtilities::parse_data /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
> STACK: Bio::Tools::EUtilities::get_ids /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
> STACK: Bio::DB::EUtilities::get_ids /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
> STACK: get_desc.pl:32
> -----------------------------------------------------------
>
>
> Nice error message though :-)
>
>
> --Russell
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
>> Sent: Monday, 11 January 2010 10:05 a.m.
>> To: 'Chris Fields'
>> Cc: 'Bhakti Dwivedi'; 'Mark A. Jensen'; 'bioperl-l at lists.open-bio.org'
>> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
>> number?
>>
>> I've started to go off eUtils recently (not BioPerl's fault) as I've often
>> been finding that with large queries, chunks of the resulting data is
>> missing.
>> For example, before Xmas I was creating species-specific databases by
>> using eUtils to get a list of GI numbers back for a taxid, then retrieving
>> the fasta sequences in chunks of 500.
>> Very regularly, in the middle of the fasta there would be a message about
>> resource unavailable eg.
>>> test_sequence_1
>> TACGATCATCGCTResource UnavailableTACGACTCTGCT
>>> test_sequence_2
>> TACGTACTACGATCGATCATCACTATCGTCATACTACTACTGACT
>>
>> Often this wasn't detected until formatdb complained about invalid
>> characters.
>> Inquiries to NCBI as to why this was happening and what to do about it
>> returned stupid answers ("do each sequence manually thru the web
>> interface", or "use eUtils").
>> As we have a nice fast network connection, I now prefer to download very
>> large gzip files (i.e. all of refseq) and extract what I need.
>>
>> I can't help but think that NCBI could solve a lot of problems if they
>> gzipped the output from eUtils queries - it's something I've requested
>> regularly for the last 5 years or so!!
>>
>> --Russell
>>
>>
>>> -----Original Message-----
>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>> Sent: Monday, 11 January 2010 9:50 a.m.
>>> To: Smithies, Russell
>>> Cc: 'Mark A. Jensen'; 'Bhakti Dwivedi'; 'bioperl-l at lists.open-bio.org'
>>> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
>>> number?
>>>
>>> One could also use Bio::DB::Taxonomy, which indexes the same files or
>>> (alternatively) makes the eutil calls (see Bio::DB::Taxonomy POD for the
>>> details).
>>>
>>> chris
>>>
>>> On Jan 10, 2010, at 2:34 PM, Smithies, Russell wrote:
>>>
>>>> An alternate non-BioPerly way (that may be faster given NCBI's
>> flakiness
>>> lately) would be to download the gi_taxid_nucl.zip or gi_taxid_prot.zip
>>> files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, load them into a hash
>> and
>>> do lookups.
>>>> In that same dir, taxdump.tar.gz contains a file called names.dmp
>> which
>>> lists taxids and descriptions (and synonyms)
>>>>
>>>> If it was me, I'd split gi_taxid_nucl and names.dmp into hashes so I
>>> could do this:
>>>>
>>>> my $taxid = $gi_taxid_nucl{$accession};
>>>> my $org_name = $names{$taxid};
>>>>
>>>> --Russell
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>>> bounces at lists.open-bio.org] On Behalf Of Mark A. Jensen
>>>>> Sent: Saturday, 26 December 2009 4:52 p.m.
>>>>> To: Bhakti Dwivedi; bioperl-l at lists.open-bio.org
>>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
>>>>> number?
>>>>>
>>>>> Bhakti,
>>>>> The following example (using EUtilities) may serve your purpose:
>>>>>
>>>>> use Bio::DB::EUtilities;
>>>>>
>>>>> my (%taxa, @taxa);
>>>>> my (%names, %idmap);
>>>>>
>>>>> # these are protein ids; nuc ids will work by changing -dbfrom =>
>>>>> 'nucleotide',
>>>>> # (probably)
>>>>>
>>>>> my @ids = qw(1621261 89318838 68536103 20807972 730439);
>>>>>
>>>>> my $factory = Bio::DB::EUtilities->new(-eutil => 'elink',
>>>>> -db => 'taxonomy',
>>>>> -dbfrom => 'protein',
>>>>> -correspondence => 1,
>>>>> -id => \@ids);
>>>>>
>>>>> # iterate through the LinkSet objects
>>>>> while (my $ds = $factory->next_LinkSet) {
>>>>> $taxa{($ds->get_submitted_ids)[0]} = ($ds->get_ids)[0]
>>>>> }
>>>>>
>>>>> @taxa = @taxa{@ids};
>>>>>
>>>>> $factory = Bio::DB::EUtilities->new(-eutil => 'esummary',
>>>>> -db => 'taxonomy',
>>>>> -id => \@taxa );
>>>>>
>>>>> while (local $_ = $factory->next_DocSum) {
>>>>> $names{($_->get_contents_by_name('TaxId'))[0]} =
>>>>> ($_->get_contents_by_name('ScientificName'))[0];
>>>>> }
>>>>>
>>>>> foreach (@ids) {
>>>>> $idmap{$_} = $names{$taxa{$_}};
>>>>> }
>>>>>
>>>>> # %idmap is
>>>>> # 1621261 => 'Mycobacterium tuberculosis H37Rv'
>>>>> # 20807972 => 'Thermoanaerobacter tengcongensis MB4'
>>>>> # 68536103 => 'Corynebacterium jeikeium K411'
>>>>> # 730439 => 'Bacillus caldolyticus'
>>>>> # 89318838 => undef (this record has been removed from the db)
>>>>>
>>>>> 1;
>>>>>
>>>>> You probably will need to break up your 30000 into chunks
>>>>> (say, 1000-3000 each), and do the above on each chunk with a
>>>>>
>>>>> sleep 3;
>>>>>
>>>>> or so separating the queries.
>>>>> MAJ
>>>>> ----- Original Message -----
>>>>> From: "Bhakti Dwivedi" <bhakti.dwivedi at gmail.com>
>>>>> To: <bioperl-l at lists.open-bio.org>
>>>>> Sent: Friday, December 25, 2009 9:46 PM
>>>>> Subject: [Bioperl-l] how to retrieve organism name from accession
>>> number?
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Does anyone know how to retrieve the "Source" or the "Species name"
>>>>> given
>>>>>> the accession number using Bioperl. I have these 30,000 accession
>>>>> numbers
>>>>>> for which I need to get the source organisms. Any kind of help will
>>> be
>>>>>> appreciated.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> BD
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>> =======================================================================
>>>> Attention: The information contained in this message and/or
>> attachments
>>>> from AgResearch Limited is intended only for the persons or entities
>>>> to which it is addressed and may contain confidential and/or
>> privileged
>>>> material. Any review, retransmission, dissemination or other use of,
>> or
>>>> taking of any action in reliance upon, this information by persons or
>>>> entities other than the intended recipients is prohibited by
>> AgResearch
>>>> Limited. If you have received this message in error, please notify the
>>>> sender immediately.
>>>>
>> =======================================================================
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list