[Bioperl-l] NCBI GenBank web retrieval

Jason Stajich jason@cgt.mc.duke.edu
Thu, 24 Jan 2002 17:22:28 -0500 (EST)


FYI:
That is what Bio::DB::GenBank and Bio::DB::GenPept with the calls
my $seq = $db->get_Seq_by_acc($accession);

or
my $seqio = $db->get_Stream_by_acc([$acc1,$acc2]);
while( my $seq = $seqio->next_seq ) {
 # process seq
}

On Thu, 24 Jan 2002, Josiah Altschuler wrote:

> I wasn't sure why the Boulder module didn't work anymore last week for
> queries, so I put it aside and wrote code to submit queries to
> http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query and just parsed the
> HTML.  This seemed to work fine.  Is it not possible to do this with
> Boulder?
>
> Josiah
>
>
> -----Original Message-----
> From: Lincoln Stein [mailto:lstein@cshl.org]
> Sent: Thursday, January 24, 2002 2:29 PM
> To: Jason Stajich; Bioperl
> Cc: Josiah Altschuler; Baumohl, Jason; pan@cshl.org
> Subject: Re: [Bioperl-l] NCBI GenBank web retrieval
>
>
> Hi All,
>
> I just spent a few hours restoring partial functionality to
> Boulder::Genbank.
>  I've been able to fix its ability to retrieve a list of accession numbers
> by
> changing the URI to use the "demo" batch retriever at
> http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/.  This seems to have exactly
> the
> same API as the old batch retriever (now retired), but adds XML support,
> which is nice.  Unfortunately, the demo retriever doesn't want to return any
>
> results in response to Entrez queries, so fetch by query doesn't work.
> Dang.
>
> The new version is uploaded to CPAN.  (I've also added proxy support for
> firewall users.)
>
> The design goal for Boulder::Genbank, you will recall, was to allow either
> retrieving a long list of sequence accessions, or an arbitrary Entrez query
> in Fasta, Genbank, or Boulder(parsed) format.  It got around NCBI's download
>
> limits by carefully breaking down the requests into small chunks and
> reissuing the requests as needed.  I was able to use this interface to fetch
>
> all the Rice ESTs (many hundreds of thousands) at regular intervals, and
> didn't have to worry about timeouts and the like.
>
> I would like to know whether the demo batch retriever is stable, or will go
> the same route as the previous batch retriever.  Also, should I just retire
> Boulder::Genbank?  If I do, does Bio::DB::GenBank support these big queries,
>
> and if so how does it do it?
>
> Lincoln
>
> On Saturday 19 January 2002 17:48, Jason Stajich wrote:
> > [jason having learned way too much about how to reverse engineer CGI]
> >
> > I've restored the functionality from previous versions of DB::GenBank and
> > DB::GenPept as we are using the new NCBI cgi /htbin-post/Entrez/query.
> > I was able to figure out that terms are encoded as being separated by '+'
> > instead of the previous ',' which had been causing only one sequence to
> > be retrieved.  Additionally I fixed a bug that retrieved the last rather
> > than the first sequence for a request that has multiple hits and use
> > get_Seq_by_(id|acc)
> >
> > I was unable to reactivate access to Batch entrez through
> > /entrez/batchentrez.cgi as that only seems to return an HTML table and I
> > am trying to avoid the 2-step query process at this time.  I attempted to
> > mimic Lincoln's functionality in Boulder::Genbank here, but alas it
> > appears that the previous /cgi-bin/Entrez/qserver.cgi/result is disabled.
> > Lincoln - I believe this breaks Boulder 1.24 Entrez access as well.  I
> > guess we can go to a 2-step retrieval by parsing HTML if people are
> > interested.
> >
> > Are there limits to size of URLs ?  I thought there might be which could
> > be a problem since the requests are sent as GETs not POSTs.  Otherwise we
> > basically have batch entrez functionality back in.
> >
> > (Roger this is essentially the fix we talked about - as best as I can
> > solve it so you can take it off your queue unless you've got ideas)
> >
> > -jason
>
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu