[Biopython-dev] Bio.Entrez batched downloads

Sat Nov 29 00:13:33 EST 2008

Sorry, but I am -1 on this. This sounds like software bloat to me.
The reason that the NCBI Entrez API is low level is that they are unable to predict how users will want to use the NCBI Entrez. We as Biopython know little more than NCBI, except that our users want to access NCBI Entrez via Python, so we provide a Python interface to NCBI Entrez. Also, I don't think that the current situation is unsatisfactory. The Bio.Entrez API is extremely simple, and with an example in the tutorial it should be very easy to use; I don't see a problem with copying and pasting from the tutorial, provided that sufficient information is available there.

--Michiel.

--- On Fri, 11/28/08, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [Biopython-dev] Bio.Entrez batched downloads
> To: "BioPython-Dev Mailing List" <biopython-dev at lists.open-bio.org>
> Date: Friday, November 28, 2008, 8:05 AM
> This is returning to a topic we've discussed in the past
> - the NCBI
> Entrez API is quite low level, and the Bio.Entrez module
> reflects
> this.  As a result certain "typical" tasks
> require more code than one
> might expect.  In particular, batched downloads of a large
> result set.
> 
> The tutorial covers using Bio.Entrez.efetch in a loop to
> download a
> result set in a batch, for example writing out a MedLine or
> FASTA
> format file.  This seems like a common need - starting
> either from a
> list of IDs, or better from a history webenv and query_key.
>  I think
> there is a use for a Bio.Entrez.batched_efetch or
> download_many
> function to save people re-implementing their own batched
> downloader
> (even just as a copy and paste from the tutorial).
> 
> If the NCBI every give any explicit guidance on batch sizes
> then we
> can update Biopython centrally - rather than individual
> scripts
> requiring changes everywhere.  We might also be able to
> include some
> basic error checking to (e.g. empty or partial downloads).
> One catch
> is that downloading and concatenating batches as XML files
> does not
> give a valid XML file - but this is safe for MedLine,
> FASTA, GenBank
> etc.  This proposed function could raise an exception if
> used with XML
> to avoid this issue.
> 
> In terms of the API for getting the data back, there are
> several options
> * Take an output handle as an argument (which would be
> written to as
> each batch was downloaded)
> * Return a handle - the implementation would be a bit more
> complicated
> as we should avoid holding everything in memory, but would
> then be
> very similar to the existing Bio.Entrez.efetch function in
> its usage.
> 
> Other options which I don't like:
> * Take an output filename (less flexible than just taking
> an output handle)
> * Return the data as a string (memory concerns with large
> downloads)
> 
> Note that related functions like the deprecated
> Bio.PubMed.download_many (and early versions of
> Bio.GenBank.download_many) used a complicated function call
> back
> mechanism (which required knowing the file format in
> advance and
> having a parser for it).  This doesn't seem sensible
> for a generic
> function.  Currently Bio.GenBank.download_many (obsolete,
> soon to be
> deprecated) just makes a single call to Bio.Entrez.efetch,
> regardless
> of the number of records / amount of data expected.
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev