[Bioperl-l] how to retrieve organism name from accession number?

Chris Fields cjfields at illinois.edu
Thu Jan 28 19:30:12 UTC 2010


Russell,

Okay, just wanted to make sure.  The email/tool requirements weren't
actually enforced up until now, which is forcing us to do a bit of
re-work on the various tools that don't have it set by default (at least
warn users unaware of it).  

And I agree, gzipped archives would be nice!

chris

On Fri, 2010-01-29 at 08:25 +1300, Smithies, Russell wrote:
> Yes, I usually set the 'tool' and 'email' parameters.
> I went to NCBI back in 2006 and did their "PowerScripting" course where they pointed out a lot of the requirements for using eUtils. I think I requested results returned gzipped back then as well...
> 
> --Russell
> 
> > -----Original Message-----
> > From: Chris Fields [mailto:cjfields at illinois.edu]
> > Sent: Friday, 29 January 2010 7:26 a.m.
> > To: Smithies, Russell
> > Cc: 'bioperl-l at lists.open-bio.org'; 'Mark A. Jensen'
> > Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> > number?
> > 
> > Russell,
> > 
> > Just curious, but have you tried setting the return email parameter
> > (-email)?  NCBI recently stated that all queries would eventually
> > require a return email of some sort (not sure if it's validated or not).
> > I think that was set for around late spring.  I'm changing the code in
> > svn to require it for that very purpose.
> > 
> > chris
> > 
> > 
> >  Wed, 2010-01-27 at 15:45 +1300, Smithies, Russell wrote:
> > > Batch-entrez http://www.ncbi.nlm.nih.gov/portal/utils/batchentrez_p.cgi
> > still works if you don't mind a bit of manual button clicking. It's
> > handling chunks of 100,000 records OK (today).
> > >
> > > --Russell
> > >
> > > > -----Original Message-----
> > > > From: Chris Fields [mailto:cjfields at illinois.edu]
> > > > Sent: Wednesday, 27 January 2010 3:42 p.m.
> > > > To: Smithies, Russell
> > > > Cc: 'bioperl-l at lists.open-bio.org'; 'Mark A. Jensen'
> > > > Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> > > > number?
> > > >
> > > > Makes me wonder if they're pushing more users towards the SOAP-based
> > > > services and away from eutils.
> > > >
> > > > chris
> > > >
> > > > On Jan 26, 2010, at 7:59 PM, Smithies, Russell wrote:
> > > >
> > > > > I've had a wide selection of errors lately:
> > > > >
> > > > > ------------- EXCEPTION: Bio::Root::Exception -------------
> > > > > MSG: NCBI esearch fatal error: Search Backend failed: Error 11
> > (Resource
> > > > temporarily unavailable)
> > > > > STACK: Error::throw
> > > > > STACK: Bio::Root::Root::throw
> > > > /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
> > > > > STACK: Bio::Tools::EUtilities::parse_data
> > > > /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
> > > > > STACK: Bio::Tools::EUtilities::get_ids
> > > > /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
> > > > > STACK: Bio::DB::EUtilities::get_ids
> > > > /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
> > > > > STACK: get_desc.pl:32
> > > > > -----------------------------------------------------------
> > > > >
> > > > > And I never get a good explanation from NCBI or suggestions on how
> > to
> > > > avoid it.
> > > > >
> > > > >
> > > > > --Russell
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Chris Fields [mailto:cjfields at illinois.edu]
> > > > >> Sent: Wednesday, 27 January 2010 2:46 p.m.
> > > > >> To: Smithies, Russell
> > > > >> Cc: 'Mark A. Jensen'; 'bioperl-l at lists.open-bio.org'
> > > > >> Subject: Re: [Bioperl-l] how to retrieve organism name from
> > accession
> > > > >> number?
> > > > >>
> > > > >> It's unfortunate but I have heard this problem popping up quite a
> > bit
> > > > more
> > > > >> frequently lately.  Not to push too many buttons but NCBI isn't
> > very
> > > > >> forthcoming with help these days; they have become quite insular.
> > Not
> > > > >> sure if they're short-staffed due to budget or if there are other
> > > > issues.
> > > > >>
> > > > >> chris
> > > > >>
> > > > >> On Jan 26, 2010, at 7:40 PM, Smithies, Russell wrote:
> > > > >>
> > > > >>> Grrrrrr, I hate eutils!!!!
> > > > >>>
> > > > >>> ------------- EXCEPTION: Bio::Root::Exception -------------
> > > > >>> MSG: NCBI esearch fatal error: Search Backend failed: Error 111
> > > > >> (Connection refused)
> > > > >>> STACK: Error::throw
> > > > >>> STACK: Bio::Root::Root::throw
> > > > >> /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:357
> > > > >>> STACK: Bio::Tools::EUtilities::parse_data
> > > > >> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:332
> > > > >>> STACK: Bio::Tools::EUtilities::get_ids
> > > > >> /usr/lib/perl5/site_perl/5.8.8/Bio/Tools/EUtilities.pm:441
> > > > >>> STACK: Bio::DB::EUtilities::get_ids
> > > > >> /usr/lib/perl5/site_perl/5.8.8/Bio/DB/EUtilities.pm:363
> > > > >>> STACK: get_desc.pl:32
> > > > >>> -----------------------------------------------------------
> > > > >>>
> > > > >>>
> > > > >>> Nice error message though :-)
> > > > >>>
> > > > >>>
> > > > >>> --Russell
> > > > >>>
> > > > >>>> -----Original Message-----
> > > > >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > > >>>> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> > > > >>>> Sent: Monday, 11 January 2010 10:05 a.m.
> > > > >>>> To: 'Chris Fields'
> > > > >>>> Cc: 'Bhakti Dwivedi'; 'Mark A. Jensen'; 'bioperl-l at lists.open-
> > > > bio.org'
> > > > >>>> Subject: Re: [Bioperl-l] how to retrieve organism name from
> > accession
> > > > >>>> number?
> > > > >>>>
> > > > >>>> I've started to go off eUtils recently (not BioPerl's fault) as
> > I've
> > > > >> often
> > > > >>>> been finding that with large queries, chunks of the resulting
> > data is
> > > > >>>> missing.
> > > > >>>> For example, before Xmas I was creating species-specific
> > databases by
> > > > >>>> using eUtils to get a list of GI numbers back for a taxid, then
> > > > >> retrieving
> > > > >>>> the fasta sequences in chunks of 500.
> > > > >>>> Very regularly, in the middle of the fasta there would be a
> > message
> > > > >> about
> > > > >>>> resource unavailable eg.
> > > > >>>>> test_sequence_1
> > > > >>>> TACGATCATCGCTResource UnavailableTACGACTCTGCT
> > > > >>>>> test_sequence_2
> > > > >>>> TACGTACTACGATCGATCATCACTATCGTCATACTACTACTGACT
> > > > >>>>
> > > > >>>> Often this wasn't detected until formatdb complained about
> > invalid
> > > > >>>> characters.
> > > > >>>> Inquiries to NCBI as to why this was happening and what to do
> > about
> > > > it
> > > > >>>> returned stupid answers ("do each sequence manually thru the web
> > > > >>>> interface", or "use eUtils").
> > > > >>>> As we have a nice fast network connection, I now prefer to
> > download
> > > > >> very
> > > > >>>> large gzip files (i.e. all of refseq) and extract what I need.
> > > > >>>>
> > > > >>>> I can't help but think that NCBI could solve a lot of problems if
> > > > they
> > > > >>>> gzipped the output from eUtils queries - it's something I've
> > > > requested
> > > > >>>> regularly for the last 5 years or so!!
> > > > >>>>
> > > > >>>> --Russell
> > > > >>>>
> > > > >>>>
> > > > >>>>> -----Original Message-----
> > > > >>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
> > > > >>>>> Sent: Monday, 11 January 2010 9:50 a.m.
> > > > >>>>> To: Smithies, Russell
> > > > >>>>> Cc: 'Mark A. Jensen'; 'Bhakti Dwivedi'; 'bioperl-l at lists.open-
> > > > bio.org'
> > > > >>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from
> > > > accession
> > > > >>>>> number?
> > > > >>>>>
> > > > >>>>> One could also use Bio::DB::Taxonomy, which indexes the same
> > files
> > > > or
> > > > >>>>> (alternatively) makes the eutil calls (see Bio::DB::Taxonomy POD
> > for
> > > > >> the
> > > > >>>>> details).
> > > > >>>>>
> > > > >>>>> chris
> > > > >>>>>
> > > > >>>>> On Jan 10, 2010, at 2:34 PM, Smithies, Russell wrote:
> > > > >>>>>
> > > > >>>>>> An alternate non-BioPerly way (that may be faster given NCBI's
> > > > >>>> flakiness
> > > > >>>>> lately) would be to download the gi_taxid_nucl.zip or
> > > > >> gi_taxid_prot.zip
> > > > >>>>> files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, load them into
> > a
> > > > hash
> > > > >>>> and
> > > > >>>>> do lookups.
> > > > >>>>>> In that same dir, taxdump.tar.gz contains a file called
> > names.dmp
> > > > >>>> which
> > > > >>>>> lists taxids and descriptions (and synonyms)
> > > > >>>>>>
> > > > >>>>>> If it was me, I'd split gi_taxid_nucl and names.dmp into hashes
> > so
> > > > I
> > > > >>>>> could do this:
> > > > >>>>>>
> > > > >>>>>> my $taxid  = $gi_taxid_nucl{$accession};
> > > > >>>>>> my $org_name = $names{$taxid};
> > > > >>>>>>
> > > > >>>>>> --Russell
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>> -----Original Message-----
> > > > >>>>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > > >>>>>>> bounces at lists.open-bio.org] On Behalf Of Mark A. Jensen
> > > > >>>>>>> Sent: Saturday, 26 December 2009 4:52 p.m.
> > > > >>>>>>> To: Bhakti Dwivedi; bioperl-l at lists.open-bio.org
> > > > >>>>>>> Subject: Re: [Bioperl-l] how to retrieve organism name from
> > > > >> accession
> > > > >>>>>>> number?
> > > > >>>>>>>
> > > > >>>>>>> Bhakti,
> > > > >>>>>>> The following example (using EUtilities) may serve your
> > purpose:
> > > > >>>>>>>
> > > > >>>>>>> use Bio::DB::EUtilities;
> > > > >>>>>>>
> > > > >>>>>>> my (%taxa, @taxa);
> > > > >>>>>>> my (%names, %idmap);
> > > > >>>>>>>
> > > > >>>>>>> # these are protein ids; nuc ids will work by changing -dbfrom
> > =>
> > > > >>>>>>> 'nucleotide',
> > > > >>>>>>> # (probably)
> > > > >>>>>>>
> > > > >>>>>>> my @ids = qw(1621261 89318838 68536103 20807972 730439);
> > > > >>>>>>>
> > > > >>>>>>> my $factory = Bio::DB::EUtilities->new(-eutil => 'elink',
> > > > >>>>>>>                                     -db => 'taxonomy',
> > > > >>>>>>>                                     -dbfrom => 'protein',
> > > > >>>>>>>                                     -correspondence => 1,
> > > > >>>>>>>                                     -id => \@ids);
> > > > >>>>>>>
> > > > >>>>>>> # iterate through the LinkSet objects
> > > > >>>>>>> while (my $ds = $factory->next_LinkSet) {
> > > > >>>>>>>  $taxa{($ds->get_submitted_ids)[0]} = ($ds->get_ids)[0]
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>> @taxa = @taxa{@ids};
> > > > >>>>>>>
> > > > >>>>>>> $factory = Bio::DB::EUtilities->new(-eutil => 'esummary',
> > > > >>>>>>>      -db    => 'taxonomy',
> > > > >>>>>>>      -id    => \@taxa );
> > > > >>>>>>>
> > > > >>>>>>> while (local $_ = $factory->next_DocSum) {
> > > > >>>>>>>  $names{($_->get_contents_by_name('TaxId'))[0]} =
> > > > >>>>>>> ($_->get_contents_by_name('ScientificName'))[0];
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>> foreach (@ids) {
> > > > >>>>>>>  $idmap{$_} = $names{$taxa{$_}};
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>> # %idmap is
> > > > >>>>>>> #    1621261 => 'Mycobacterium tuberculosis H37Rv'
> > > > >>>>>>> #    20807972 => 'Thermoanaerobacter tengcongensis MB4'
> > > > >>>>>>> #    68536103 => 'Corynebacterium jeikeium K411'
> > > > >>>>>>> #    730439 => 'Bacillus caldolyticus'
> > > > >>>>>>> #    89318838 => undef    (this record has been removed from
> > the
> > > > db)
> > > > >>>>>>>
> > > > >>>>>>> 1;
> > > > >>>>>>>
> > > > >>>>>>> You probably will need to break up your 30000 into chunks
> > > > >>>>>>> (say, 1000-3000 each), and do the above on each chunk with a
> > > > >>>>>>>
> > > > >>>>>>> sleep 3;
> > > > >>>>>>>
> > > > >>>>>>> or so separating the queries.
> > > > >>>>>>> MAJ
> > > > >>>>>>> ----- Original Message -----
> > > > >>>>>>> From: "Bhakti Dwivedi" <bhakti.dwivedi at gmail.com>
> > > > >>>>>>> To: <bioperl-l at lists.open-bio.org>
> > > > >>>>>>> Sent: Friday, December 25, 2009 9:46 PM
> > > > >>>>>>> Subject: [Bioperl-l] how to retrieve organism name from
> > accession
> > > > >>>>> number?
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> Hi,
> > > > >>>>>>>>
> > > > >>>>>>>> Does anyone know how to retrieve the "Source" or the "Species
> > > > name"
> > > > >>>>>>> given
> > > > >>>>>>>> the accession number using Bioperl.   I have these 30,000
> > > > accession
> > > > >>>>>>> numbers
> > > > >>>>>>>> for which I need to get the source organisms.  Any kind of
> > help
> > > > >> will
> > > > >>>>> be
> > > > >>>>>>>> appreciated.
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks
> > > > >>>>>>>>
> > > > >>>>>>>> BD
> > > > >>>>>>>> _______________________________________________
> > > > >>>>>>>> Bioperl-l mailing list
> > > > >>>>>>>> Bioperl-l at lists.open-bio.org
> > > > >>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> _______________________________________________
> > > > >>>>>>> Bioperl-l mailing list
> > > > >>>>>>> Bioperl-l at lists.open-bio.org
> > > > >>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > > > >>>>>>
> > > > >>>>
> > > >
> > =======================================================================
> > > > >>>>>> Attention: The information contained in this message and/or
> > > > >>>> attachments
> > > > >>>>>> from AgResearch Limited is intended only for the persons or
> > > > entities
> > > > >>>>>> to which it is addressed and may contain confidential and/or
> > > > >>>> privileged
> > > > >>>>>> material. Any review, retransmission, dissemination or other
> > use
> > > > of,
> > > > >>>> or
> > > > >>>>>> taking of any action in reliance upon, this information by
> > persons
> > > > or
> > > > >>>>>> entities other than the intended recipients is prohibited by
> > > > >>>> AgResearch
> > > > >>>>>> Limited. If you have received this message in error, please
> > notify
> > > > >> the
> > > > >>>>>> sender immediately.
> > > > >>>>>>
> > > > >>>>
> > > >
> > =======================================================================
> > > > >>>>>>
> > > > >>>>>> _______________________________________________
> > > > >>>>>> Bioperl-l mailing list
> > > > >>>>>> Bioperl-l at lists.open-bio.org
> > > > >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > > > >>>>
> > > > >>>>
> > > > >>>> _______________________________________________
> > > > >>>> Bioperl-l mailing list
> > > > >>>> Bioperl-l at lists.open-bio.org
> > > > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Bioperl-l mailing list
> > > > > Bioperl-l at lists.open-bio.org
> > > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > 
> 





More information about the Bioperl-l mailing list