[Biopython] entire sequence file is unintentionally being loaded

Martin Mokrejs mmokrejs at fold.natur.cuni.cz
Wed Nov 9 19:07:48 UTC 2016


Hi Liam,

Liam Thompson wrote:
> Hi everyone
>
> I have written a demultiplexing script for an Illumina NGS library, where I analyse each pair sequence, find the barcode-primer from a dictionary, and assign the reads to a sample file. I'm using python2.7 for compatibility reasons on a Linux machine, and the most recent biopython.
>
> Obviously, I don't want to load the entire sequence file into memory which is what I have tried to avoid by indexing the reads with biopy first which Peter helped with on a previous email.
>
> So I take the index dictionary like object I receive from the index function and merge the values with zip so that I have the paired reads information in one tuple.
>
>     for r1, r2 in zip(self.R1.values(), self.R2.values()):
>         pair_seq_dict = {'r1' : r1, 'r2' : r2}
>
> I thought fetching the R1 and R2 values like this would essentially continuously query the index until the index has run out of values to return. I've obviously missed something or am implementing it wrong.

Python expands initially the items to loop over before it starts the loop (so it unpacks the data in memory). You should avoid for loop or wrap the zip() call with itertools.chain() or even better use itertools.izip() directly.

Depending on the for loop contents, you could probably rewrite the loop into a generator expression (using the square brackets) and avoid the need for itertools altogether.

>
> I have checked the output log where I log the output of the values in the code, and the entire file is not read into memory. Or at least that is what displaying the variables contents says. They only ever seem to have just the R1 and R2 equivalent Seq objects (so two sequences worth of info).
>
> So how my question is how do I find out what is going on? What have I misunderstood? What is the best way for me to iterate over the index given that I have two indices (R1 and R2) and analyse the reads as a pair. I suspect the it is the .values() command where I am going wrong.

Yes, .values() is another problem, here you explicitly ask for expansion of the index contents in the memory. Avoid .values() everywhere in your code.

Hope this helps,
Martin


More information about the Biopython mailing list