[Biopython] entire sequence file is unintentionally being loaded

Wed Nov 9 20:18:48 UTC 2016

On Wed, Nov 9, 2016 at 7:07 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
>
> Python expands initially the items to loop over before it starts the loop
> (so it unpacks the data in memory). You should avoid for loop or wrap the
> zip() call with itertools.chain() or even better use itertools.izip()
> directly.

Yes, izip is probably the right choice here (on Python 2).
Or switch to Python 3 if you can, and use zip.

Or, for dual-platform code, try the module six?

> Depending on the for loop contents, you could probably rewrite the loop into
> a generator expression (using the square brackets) and avoid the need for
> itertools altogether.

Possibly yes.

>> So how my question is how do I find out what is going on? What have I
>> misunderstood? What is the best way for me to iterate over the index given
>> that I have two indices (R1 and R2) and analyse the reads as a pair. I
>> suspect the it is the .values() command where I am going wrong.
>
> Yes, .values() is another problem, here you explicitly ask for expansion of
> the index contents in the memory. Avoid .values() everywhere in your code.

Well, in general yes (under Python 2), but in this specific case it is
probably fine:

With Python dictionaries, under Python 2, .values(), .items() and .keys()
will give you lists expanded in memory - but iterators under Python 3.
Under Python 2 you'd use .itervalues(), .iteritems() or .iterkeys().

The dictionary like index objects from Biopython SeqIO will give you
iterators under both Python 2 and Python 3 (because this code was
intended for use on large files, so returning all the contents as a list
of SeqRecord objects would most likely cause memory problems).

(Biopython also provides .itervalues() etc under Python 2 for the
SeqIO dictionary-like index, which act the same.)

Peter