[Bioperl-l] NCBI GenBank web retrieval

Lincoln Stein lstein@cshl.org
Fri, 25 Jan 2002 11:00:48 -0500


Yes, it's possible, but the HTML interface will only give you a limited 
number of results per page, so you have to keep going back to it.  Unless of 
course you've got a trick for getting all the results in one long list, in 
which case I'd like to know the magic invocation.  Parsing the HTML is 
probably even more brittle than the way we were doing it before, because a 
slight change will break things.

Lincoln

On Thursday 24 January 2002 16:10, Josiah Altschuler wrote:
> I wasn't sure why the Boulder module didn't work anymore last week for
> queries, so I put it aside and wrote code to submit queries to
> http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query and just parsed the
> HTML.  This seemed to work fine.  Is it not possible to do this with
> Boulder?
>
> Josiah
>
>
> -----Original Message-----
> From: Lincoln Stein [mailto:lstein@cshl.org]
> Sent: Thursday, January 24, 2002 2:29 PM
> To: Jason Stajich; Bioperl
> Cc: Josiah Altschuler; Baumohl, Jason; pan@cshl.org
> Subject: Re: [Bioperl-l] NCBI GenBank web retrieval
>
>
> Hi All,
>
> I just spent a few hours restoring partial functionality to
> Boulder::Genbank.
>  I've been able to fix its ability to retrieve a list of accession numbers
> by
> changing the URI to use the "demo" batch retriever at
> http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/.  This seems to have exactly
> the
> same API as the old batch retriever (now retired), but adds XML support,
> which is nice.  Unfortunately, the demo retriever doesn't want to return
> any
>
> results in response to Entrez queries, so fetch by query doesn't work.
> Dang.
>
> The new version is uploaded to CPAN.  (I've also added proxy support for
> firewall users.)
>
> The design goal for Boulder::Genbank, you will recall, was to allow either
> retrieving a long list of sequence accessions, or an arbitrary Entrez query
> in Fasta, Genbank, or Boulder(parsed) format.  It got around NCBI's
> download
>
> limits by carefully breaking down the requests into small chunks and
> reissuing the requests as needed.  I was able to use this interface to
> fetch
>
> all the Rice ESTs (many hundreds of thousands) at regular intervals, and
> didn't have to worry about timeouts and the like.
>
> I would like to know whether the demo batch retriever is stable, or will go
> the same route as the previous batch retriever.  Also, should I just retire
> Boulder::Genbank?  If I do, does Bio::DB::GenBank support these big
> queries,
>
> and if so how does it do it?
>
> Lincoln
>
> On Saturday 19 January 2002 17:48, Jason Stajich wrote:
> > [jason having learned way too much about how to reverse engineer CGI]
> >
> > I've restored the functionality from previous versions of DB::GenBank and
> > DB::GenPept as we are using the new NCBI cgi /htbin-post/Entrez/query.
> > I was able to figure out that terms are encoded as being separated by '+'
> > instead of the previous ',' which had been causing only one sequence to
> > be retrieved.  Additionally I fixed a bug that retrieved the last rather
> > than the first sequence for a request that has multiple hits and use
> > get_Seq_by_(id|acc)
> >
> > I was unable to reactivate access to Batch entrez through
> > /entrez/batchentrez.cgi as that only seems to return an HTML table and I
> > am trying to avoid the 2-step query process at this time.  I attempted to
> > mimic Lincoln's functionality in Boulder::Genbank here, but alas it
> > appears that the previous /cgi-bin/Entrez/qserver.cgi/result is disabled.
> > Lincoln - I believe this breaks Boulder 1.24 Entrez access as well.  I
> > guess we can go to a 2-step retrieval by parsing HTML if people are
> > interested.
> >
> > Are there limits to size of URLs ?  I thought there might be which could
> > be a problem since the requests are sent as GETs not POSTs.  Otherwise we
> > basically have batch entrez functionality back in.
> >
> > (Roger this is essentially the fix we talked about - as best as I can
> > solve it so you can take it off your queue unless you've got ideas)
> >
> > -jason

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================