[Biopython] entire sequence file is unintentionally being loaded

Wed Nov 9 18:42:53 UTC 2016

Hi everyone

I have written a demultiplexing script for an Illumina NGS library, 
where I analyse each pair sequence, find the barcode-primer from a 
dictionary, and assign the reads to a sample file. I'm using python2.7 
for compatibility reasons on a Linux machine, and the most recent biopython.

Obviously, I don't want to load the entire sequence file into memory 
which is what I have tried to avoid by indexing the reads with biopy 
first which Peter helped with on a previous email.

So I take the index dictionary like object I receive from the index 
function and merge the values with zip so that I have the paired reads 
information in one tuple.

     for r1, r2 in zip(self.R1.values(), self.R2.values()):
         pair_seq_dict = {'r1' : r1, 'r2' : r2}

I thought fetching the R1 and R2 values like this would essentially 
continuously query the index until the index has run out of values to 
return. I've obviously missed something or am implementing it wrong.

I have checked the output log where I log the output of the values in 
the code, and the entire file is not read into memory. Or at least that 
is what displaying the variables contents says. They only ever seem to 
have just the R1 and R2 equivalent Seq objects (so two sequences worth 
of info).

So how my question is how do I find out what is going on? What have I 
misunderstood? What is the best way for me to iterate over the index 
given that I have two indices (R1 and R2) and analyse the reads as a 
pair. I suspect the it is the .values() command where I am going wrong.

I really appreciate any comments or help

Kind regards

Liam Thompson
Mölndal