[Biopython-dev] Bio.Entrez batched downloads

Fri Nov 28 08:05:38 EST 2008

This is returning to a topic we've discussed in the past - the NCBI
Entrez API is quite low level, and the Bio.Entrez module reflects
this.  As a result certain "typical" tasks require more code than one
might expect.  In particular, batched downloads of a large result set.

The tutorial covers using Bio.Entrez.efetch in a loop to download a
result set in a batch, for example writing out a MedLine or FASTA
format file.  This seems like a common need - starting either from a
list of IDs, or better from a history webenv and query_key.  I think
there is a use for a Bio.Entrez.batched_efetch or download_many
function to save people re-implementing their own batched downloader
(even just as a copy and paste from the tutorial).

If the NCBI every give any explicit guidance on batch sizes then we
can update Biopython centrally - rather than individual scripts
requiring changes everywhere.  We might also be able to include some
basic error checking to (e.g. empty or partial downloads). One catch
is that downloading and concatenating batches as XML files does not
give a valid XML file - but this is safe for MedLine, FASTA, GenBank
etc.  This proposed function could raise an exception if used with XML
to avoid this issue.

In terms of the API for getting the data back, there are several options
* Take an output handle as an argument (which would be written to as
each batch was downloaded)
* Return a handle - the implementation would be a bit more complicated
as we should avoid holding everything in memory, but would then be
very similar to the existing Bio.Entrez.efetch function in its usage.

Other options which I don't like:
* Take an output filename (less flexible than just taking an output handle)
* Return the data as a string (memory concerns with large downloads)

Note that related functions like the deprecated
Bio.PubMed.download_many (and early versions of
Bio.GenBank.download_many) used a complicated function call back
mechanism (which required knowing the file format in advance and
having a parser for it).  This doesn't seem sensible for a generic
function.  Currently Bio.GenBank.download_many (obsolete, soon to be
deprecated) just makes a single call to Bio.Entrez.efetch, regardless
of the number of records / amount of data expected.

Peter