[Biopython] how Entrez.parse() internally work

Fri Dec 11 15:44:43 UTC 2015

On Thu, Dec 10, 2015 at 11:10 AM,  <c.buhtz at posteo.jp> wrote:
> On 2015-12-09 21:25 Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> Almost certainly asking for 5 GB like that will fail. You should
>> request much smaller batches of data, by making multiple
>> calls to efetch with an increasing start value.
>
> Exactly that is what I want to prevend when using eFetch.parse() - if
> it is possible. That is what my question is about. ;)
>
> Using parse() is not only about use of my RAM - it is about workload
> for the NCBI-servers.

Yes. I think this is why the NCBI want people to use the history
feature and make multiple calls to efetch to retrieve the data
in batches.

>> > When I call Entrez.eFetch(retmax=999999)?
>> > Or is physically/really only one record (some KBytes, not much)
>> > transfered from NCBI to me while each iteration (or next())?
>>
>> It should be a few Kbytes at a time as each record is parsed.
>
> Nice, then I see no need to separate my requests on NCBI because
> parse() does his for me when I iterate with it.
>
> Maybe I misunderstand it? ;)

I don't know enough about how the NCBI implements this
(and it is probably different for each database as Entrez does
seem to connect to different back-end systems).

The problem is the more data you request in a single efetch call,
the more likely it is to fail (due to network issues or even a timeout
at the NCBI).

Making many small requests is also inefficient. There is a middle
ground of batch size - please email the NCBI about your specific
request and ask what size matches you should use with efetch?
i.e. what retmax value.

Peter