[Biopython] how Entrez.parse() internally work

Peter Cock p.j.a.cock at googlemail.com
Wed Dec 9 13:23:11 UTC 2015


On Wed, Dec 9, 2015 at 7:53 AM,  <c.buhtz at posteo.jp> wrote:
> Can someone please explain how Entrez.parse() internally really work,
> from the viewpoint of the NCBI-servers.
>
> e.g.
> # returning e.g. 100.000 recods
> handle = Entrez.efetch(retstart=0, retmax=999999, ...)
> record = Entrez.parse(handle)
>
> In that case "record" doesn't hold all records in RAM at the same time.
> This would cause a out-of-RAM-error on my small system if it would hold
> that all in RAM.

It would be confusing to write it that way, instead maybe:

record_iterator = Entrez.parse(handle)

At this point the parsing has not started. When you use the
iterator in a loop, then parsing begins.

for record in record_iterator:
    # one record was parsed from the handle and into RAM

Only one record at a time is loaded - the loop keeps going
until the end of the handle is reached (could be end of the
file, or end of the data sent by the NCBI in this case).

Assuming Python 3 you can also use the next function:

record_iterator = Entrez.parse(handle)
record = next(record_iterator)  # first record parsed
record = next(record_iterator)  # second record
record = next(record_iterator)  # third record

But a loop would be the normal way to do this.

> The question is?
> Where are all these records (physicaly)?
> And how often is the NCBI-Server really requested (with eFetch) and
> with which values of retmax and retstart?

There is one call to  Entrez.efetch using the retstart and retmax
values given. The NCBI will return a stream of data (like a file
handle) containing one record after another.

You could write all this to a file (and parse it later). If you use
the Entrez.parse(handle) approach with a loop, then the data
is gradually taken from the handle - after each record then
the for loop code is executed.

The earlier parts of the Biopython tutorial look at this with
the SeqIO.parse(...) iterator and files or handles. See also
general introductions/tutorials to Python iterators.

Peter


More information about the Biopython mailing list