[Biopython] entire sequence file is unintentionally being loaded

Liam Thompson dejmail at gmail.com
Wed Nov 9 20:29:05 UTC 2016


Hi Peter

Apologies for the inadequate description, but you understood the gist of it.

Thank you for the suggestions. You were right about zip(), I was unaware 
that it would override the memory cautious operators. The itertools.izip 
seems to have sorted things out as suggested, although now I need to 
spend some time speeding the whole script up.

I will try the .itervalues() as well. I did try that, but it complained 
as well but perhaps for different reasons. I will investigate and report 
back.


Liam




On 2016-11-09 21:18, Peter Cock wrote:
> On Wed, Nov 9, 2016 at 7:07 PM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>>
>> Python expands initially the items to loop over before it starts the loop
>> (so it unpacks the data in memory). You should avoid for loop or wrap the
>> zip() call with itertools.chain() or even better use itertools.izip()
>> directly.
>
> Yes, izip is probably the right choice here (on Python 2).
> Or switch to Python 3 if you can, and use zip.
>
> Or, for dual-platform code, try the module six?
>
>> Depending on the for loop contents, you could probably rewrite the loop into
>> a generator expression (using the square brackets) and avoid the need for
>> itertools altogether.
>
> Possibly yes.
>
>>> So how my question is how do I find out what is going on? What have I
>>> misunderstood? What is the best way for me to iterate over the index given
>>> that I have two indices (R1 and R2) and analyse the reads as a pair. I
>>> suspect the it is the .values() command where I am going wrong.
>>
>> Yes, .values() is another problem, here you explicitly ask for expansion of
>> the index contents in the memory. Avoid .values() everywhere in your code.
>
> Well, in general yes (under Python 2), but in this specific case it is
> probably fine:
>
> With Python dictionaries, under Python 2, .values(), .items() and .keys()
> will give you lists expanded in memory - but iterators under Python 3.
> Under Python 2 you'd use .itervalues(), .iteritems() or .iterkeys().
>
> The dictionary like index objects from Biopython SeqIO will give you
> iterators under both Python 2 and Python 3 (because this code was
> intended for use on large files, so returning all the contents as a list
> of SeqRecord objects would most likely cause memory problems).
>
> (Biopython also provides .itervalues() etc under Python 2 for the
> SeqIO dictionary-like index, which act the same.)
>
> Peter
>


More information about the Biopython mailing list