[Bioperl-l] how to retrieve organism name from accession number?

Smithies, Russell Russell.Smithies at agresearch.co.nz
Sun Jan 10 21:05:06 UTC 2010


I've started to go off eUtils recently (not BioPerl's fault) as I've often been finding that with large queries, chunks of the resulting data is missing.
For example, before Xmas I was creating species-specific databases by using eUtils to get a list of GI numbers back for a taxid, then retrieving the fasta sequences in chunks of 500.
Very regularly, in the middle of the fasta there would be a message about resource unavailable eg.
  >test_sequence_1
  TACGATCATCGCTResource UnavailableTACGACTCTGCT
  >test_sequence_2
  TACGTACTACGATCGATCATCACTATCGTCATACTACTACTGACT

Often this wasn't detected until formatdb complained about invalid characters.
Inquiries to NCBI as to why this was happening and what to do about it returned stupid answers ("do each sequence manually thru the web interface", or "use eUtils").
As we have a nice fast network connection, I now prefer to download very large gzip files (i.e. all of refseq) and extract what I need.

I can't help but think that NCBI could solve a lot of problems if they gzipped the output from eUtils queries - it's something I've requested regularly for the last 5 years or so!!

--Russell


> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu]
> Sent: Monday, 11 January 2010 9:50 a.m.
> To: Smithies, Russell
> Cc: 'Mark A. Jensen'; 'Bhakti Dwivedi'; 'bioperl-l at lists.open-bio.org'
> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> number?
> 
> One could also use Bio::DB::Taxonomy, which indexes the same files or
> (alternatively) makes the eutil calls (see Bio::DB::Taxonomy POD for the
> details).
> 
> chris
> 
> On Jan 10, 2010, at 2:34 PM, Smithies, Russell wrote:
> 
> > An alternate non-BioPerly way (that may be faster given NCBI's flakiness
> lately) would be to download the gi_taxid_nucl.zip or gi_taxid_prot.zip
> files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, load them into a hash and
> do lookups.
> > In that same dir, taxdump.tar.gz contains a file called names.dmp which
> lists taxids and descriptions (and synonyms)
> >
> > If it was me, I'd split gi_taxid_nucl and names.dmp into hashes so I
> could do this:
> >
> >   my $taxid  = $gi_taxid_nucl{$accession};
> >   my $org_name = $names{$taxid};
> >
> > --Russell
> >
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Mark A. Jensen
> >> Sent: Saturday, 26 December 2009 4:52 p.m.
> >> To: Bhakti Dwivedi; bioperl-l at lists.open-bio.org
> >> Subject: Re: [Bioperl-l] how to retrieve organism name from accession
> >> number?
> >>
> >> Bhakti,
> >> The following example (using EUtilities) may serve your purpose:
> >>
> >> use Bio::DB::EUtilities;
> >>
> >> my (%taxa, @taxa);
> >> my (%names, %idmap);
> >>
> >> # these are protein ids; nuc ids will work by changing -dbfrom =>
> >> 'nucleotide',
> >> # (probably)
> >>
> >> my @ids = qw(1621261 89318838 68536103 20807972 730439);
> >>
> >> my $factory = Bio::DB::EUtilities->new(-eutil => 'elink',
> >>                                       -db => 'taxonomy',
> >>                                       -dbfrom => 'protein',
> >>                                       -correspondence => 1,
> >>                                       -id => \@ids);
> >>
> >> # iterate through the LinkSet objects
> >> while (my $ds = $factory->next_LinkSet) {
> >>    $taxa{($ds->get_submitted_ids)[0]} = ($ds->get_ids)[0]
> >> }
> >>
> >> @taxa = @taxa{@ids};
> >>
> >> $factory = Bio::DB::EUtilities->new(-eutil => 'esummary',
> >>        -db    => 'taxonomy',
> >>        -id    => \@taxa );
> >>
> >> while (local $_ = $factory->next_DocSum) {
> >>    $names{($_->get_contents_by_name('TaxId'))[0]} =
> >> ($_->get_contents_by_name('ScientificName'))[0];
> >> }
> >>
> >> foreach (@ids) {
> >>    $idmap{$_} = $names{$taxa{$_}};
> >> }
> >>
> >> # %idmap is
> >> #    1621261 => 'Mycobacterium tuberculosis H37Rv'
> >> #    20807972 => 'Thermoanaerobacter tengcongensis MB4'
> >> #    68536103 => 'Corynebacterium jeikeium K411'
> >> #    730439 => 'Bacillus caldolyticus'
> >> #    89318838 => undef    (this record has been removed from the db)
> >>
> >> 1;
> >>
> >> You probably will need to break up your 30000 into chunks
> >> (say, 1000-3000 each), and do the above on each chunk with a
> >>
> >> sleep 3;
> >>
> >> or so separating the queries.
> >> MAJ
> >> ----- Original Message -----
> >> From: "Bhakti Dwivedi" <bhakti.dwivedi at gmail.com>
> >> To: <bioperl-l at lists.open-bio.org>
> >> Sent: Friday, December 25, 2009 9:46 PM
> >> Subject: [Bioperl-l] how to retrieve organism name from accession
> number?
> >>
> >>
> >>> Hi,
> >>>
> >>> Does anyone know how to retrieve the "Source" or the "Species name"
> >> given
> >>> the accession number using Bioperl.   I have these 30,000 accession
> >> numbers
> >>> for which I need to get the source organisms.  Any kind of help will
> be
> >>> appreciated.
> >>>
> >>> Thanks
> >>>
> >>> BD
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list