[Bioperl-l] NCBI GenBank web retrieval

Lincoln Stein lstein@cshl.org
Thu, 24 Jan 2002 14:29:08 -0500


Hi All,

I just spent a few hours restoring partial functionality to Boulder::Genbank. 
 I've been able to fix its ability to retrieve a list of accession numbers by 
changing the URI to use the "demo" batch retriever at 
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/.  This seems to have exactly the 
same API as the old batch retriever (now retired), but adds XML support, 
which is nice.  Unfortunately, the demo retriever doesn't want to return any 
results in response to Entrez queries, so fetch by query doesn't work.  Dang.

The new version is uploaded to CPAN.  (I've also added proxy support for 
firewall users.)

The design goal for Boulder::Genbank, you will recall, was to allow either 
retrieving a long list of sequence accessions, or an arbitrary Entrez query 
in Fasta, Genbank, or Boulder(parsed) format.  It got around NCBI's download 
limits by carefully breaking down the requests into small chunks and 
reissuing the requests as needed.  I was able to use this interface to fetch 
all the Rice ESTs (many hundreds of thousands) at regular intervals, and 
didn't have to worry about timeouts and the like.

I would like to know whether the demo batch retriever is stable, or will go 
the same route as the previous batch retriever.  Also, should I just retire 
Boulder::Genbank?  If I do, does Bio::DB::GenBank support these big queries, 
and if so how does it do it?

Lincoln

On Saturday 19 January 2002 17:48, Jason Stajich wrote:
> [jason having learned way too much about how to reverse engineer CGI]
>
> I've restored the functionality from previous versions of DB::GenBank and
> DB::GenPept as we are using the new NCBI cgi /htbin-post/Entrez/query.
> I was able to figure out that terms are encoded as being separated by '+'
> instead of the previous ',' which had been causing only one sequence to
> be retrieved.  Additionally I fixed a bug that retrieved the last rather
> than the first sequence for a request that has multiple hits and use
> get_Seq_by_(id|acc)
>
> I was unable to reactivate access to Batch entrez through
> /entrez/batchentrez.cgi as that only seems to return an HTML table and I
> am trying to avoid the 2-step query process at this time.  I attempted to
> mimic Lincoln's functionality in Boulder::Genbank here, but alas it
> appears that the previous /cgi-bin/Entrez/qserver.cgi/result is disabled.
> Lincoln - I believe this breaks Boulder 1.24 Entrez access as well.  I
> guess we can go to a 2-step retrieval by parsing HTML if people are
> interested.
>
> Are there limits to size of URLs ?  I thought there might be which could
> be a problem since the requests are sent as GETs not POSTs.  Otherwise we
> basically have batch entrez functionality back in.
>
> (Roger this is essentially the fix we talked about - as best as I can
> solve it so you can take it off your queue unless you've got ideas)
>
> -jason

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================