[Biopython] [Entrez/eFetch] "reasonable" package

Thu Dec 3 09:07:16 UTC 2015

Hi,

As Peter says, it’s probably dependent on the database/file format you want to download. I’ve had success by downloading NCBI records in batches, and keeping them in a local list, then treating them as though they were a single returned list of records.

For a recent script I used this approach to batch large result sets:

i) Wrap the Entrez search in a retry function (args.retries comes from argparse. It’s lazy not to make it a function argument - I should fix that).

# Retry Entrez requests
def entrez_retry(fn, *fnargs, **fnkwargs):
    """Retries the passed function up to the number of times specified
    by args.retries
    """
    tries, success = 0, False
    while not success and tries < args.retries:
        try:
            output = fn(*fnargs, **fnkwargs)
            success = True
        except:
            tries += 1
            logger.warning("Entrez query %s(%s, %s) failed (%d/%d)" %
                           (fn, fnargs, fnkwargs, tries+1, args.retries))
            logger.warning(last_exception())
    if not success:
        logger.error("Too many Entrez failures (exiting)")
        sys.exit(1)
    return output

ii) Wrap pulling record IDs from NCBI in batches, using the webhistory:

# Get results from NCBI web history, in batches
def entrez_batch_webhistory(record, expected, batchsize, *fnargs, **fnkwargs):
    """Recovers the Entrez data from a prior NCBI webhistory search, in
    batches of defined size, using Efetch. Returns all results as a list.
    - record: Entrez webhistory record
    - expected: number of expected search returns
    - batchsize: how many search returns to retrieve in a batch
    - *fnargs: arguments to Efetch
    - **fnkwargs: keyword arguments to Efetch
    """
    results = []
    for start in range(0, expected, batchsize):
        batch_handle = entrez_retry(Entrez.efetch,
                                    retstart=start, retmax=batchsize,
                                    webenv=record["WebEnv"],
                                    query_key=record["QueryKey"],
                                    *fnargs, **fnkwargs)
        batch_record = Entrez.read(batch_handle)
        results.extend(batch_record)
    return results

iii) Run complete query, saving record IDs to webhistory (this could at times identify thousands of records) e.g. here, record has [“WebEnv”] and [“QueryKey”] fields that allow you to recover the results later. It also has a [‘Count’] field that tells you how many total records you should expect back. In my experience this caps at 100,000 - even though sometimes there have been more records to return. I have no robust, reliable way to overcome this.

    # Use NCBI history for the search.
    handle = entrez_retry(Entrez.esearch, db="assembly", term=query,
                           format="xml", usehistory="y”)
    record = Entrez.read(handle)
    # Recover assembly UIDs from the web history
    asm_ids = entrez_batch_webhistory(record, int(record[‘Count’]), 250,
                                      db="assembly", retmode="xml")

YMMV, but I hope this is helpful.

Cheers,

L.

On 2 Dec 2015, at 22:04, Peter Cock <p.j.a.cock at googlemail.com<mailto:p.j.a.cock at googlemail.com>> wrote:

Hi,

Currently Biopython does not attempt to do anything about
limiting retmax on your behalf.  The suggested retmax limit of 500
is probably specific to that database and/or file format (or so I
would imagine - some records like uilists are tiny in comparison).

Are you using the results as XML? It probably is possible to
merge the XML files, but it might be more hassle that its worth.

I would suggest a double loop ought to work fine - loop over
the collection of XML files, and then for each file loop over the
records returned from the parser.

Regards,

Peter

On Wed, Dec 2, 2015 at 9:39 PM, <c.buhtz at posteo.jp<mailto:c.buhtz at posteo.jp>> wrote:
I asked the Entrez support how should I tread the servers resources
with "respect". :)

First answer was without discrete numbers but in the second one they
told me asking for 500 (retmax for eSearch) is a "reasonable" value
because the eBot (a tool they offer on their website) use it, too.

No I have nearly 13.000 PIDs I want to fetch their article infos via
eFetch. It is a lot. ;)

But I am not sure how to do that with biopython. When I separate that
in 500-packages I would have 26 different record objects back.
I don't like that. I would prefer one big record object I can analyse.

Do you see a way to merge this record objects. Or maybe there is
another way for that?
Or does Biopython.Entrez still handle that problem internal (like the
only-3-per-second-querys-rule or the HTTP-POST-decision)?

Any suggestions?
--
GnuPGP-Key ID 0751A8EC
_______________________________________________
Biopython mailing list  -  Biopython at mailman.open-bio.org<mailto:Biopython at mailman.open-bio.org>
http://mailman.open-bio.org/mailman/listinfo/biopython

_______________________________________________
Biopython mailing list  -  Biopython at mailman.open-bio.org<mailto:Biopython at mailman.open-bio.org>
http://mailman.open-bio.org/mailman/listinfo/biopython

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e: leighton.pritchard at hutton.ac.uk<mailto:leighton.pritchard at hutton.ac.uk>       w: http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827

If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system.  
Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your 
responsibility to scan the email and any attachments.

The James Hutton Institute is a Scottish charitable company limited by guarantee.
Registered in Scotland No. SC374831
Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA.
Charity No. SC041796
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151203/39223a7f/attachment.html>