[Biopython] entire sequence file is unintentionally being loaded

Wed Nov 9 20:11:04 UTC 2016

Hi Liam,

Your description is a little difficult for me to follow. Explaining
with an example would help a lot.

A note of warning though, on Python 2 using zip will make
a list. i.e. This will load everything into memory, even if the
returns from .values() were memory-cautious iterators:

zip(self.R1.values(), self.R2.values())

You can explicitly use "from itertools import izip" instead
to get Python 3 like behaviour.

Assuming your two files are nicely matched, if at all possible
just to iterate over them both together, making one pass over
each file in total (no indexing, no random access/seeking) e.g.

from itertools import izip
# Open the output files, one for each barcode
output_handles = dict()
for barcode in known_barcodes:
    output_handle[barcode] = open(barcode + "_out.fastq", "w")
# Now the single loop over both FASTQ input files
iter1 = SeqIO.parse(file1, "fastq")
iter2 = SeqIO.parse(file2, "fastq")
for r1, r2 in izip(iter1, iter2):
    assert r1.id == r2.id  # modify if /1 and /2 style
    # do stuff with r1 and r2,
    # e.g. output pair to a barcode specific file
    barcode = .... # fill this in using read sequence
    SeqIO.write([r1, r2], output_handle[barcode], "fastq")
# Finally, close handles
for barcode in known_barcodes:
    output_handle[barcode].close()

(Or similar using the low-level string FASTQ parser).

If you have two indexed files though, and a set of record
identifiers to look up, something like this might be what
you are looking for:

index1 = SeqIO.index(file1, "fastq")
index2 = SeqIO.index(file1, "fastq")
for key in my_list_of_keys:
   r1 = index1[key]  # for old style names use key + "/1"
   r2 = index2[key]  # for old style names use key + "/2"
   # do stuff with r1 and r2

Peter

On Wed, Nov 9, 2016 at 6:42 PM, Liam Thompson <dejmail at gmail.com> wrote:
> Hi everyone
>
> I have written a demultiplexing script for an Illumina NGS library, where I
> analyse each pair sequence, find the barcode-primer from a dictionary, and
> assign the reads to a sample file. I'm using python2.7 for compatibility
> reasons on a Linux machine, and the most recent biopython.
>
> Obviously, I don't want to load the entire sequence file into memory which
> is what I have tried to avoid by indexing the reads with biopy first which
> Peter helped with on a previous email.
>
> So I take the index dictionary like object I receive from the index function
> and merge the values with zip so that I have the paired reads information in
> one tuple.
>
>     for r1, r2 in zip(self.R1.values(), self.R2.values()):
>         pair_seq_dict = {'r1' : r1, 'r2' : r2}
>
> I thought fetching the R1 and R2 values like this would essentially
> continuously query the index until the index has run out of values to
> return. I've obviously missed something or am implementing it wrong.
>
> I have checked the output log where I log the output of the values in the
> code, and the entire file is not read into memory. Or at least that is what
> displaying the variables contents says. They only ever seem to have just the
> R1 and R2 equivalent Seq objects (so two sequences worth of info).
>
> So how my question is how do I find out what is going on? What have I
> misunderstood? What is the best way for me to iterate over the index given
> that I have two indices (R1 and R2) and analyse the reads as a pair. I
> suspect the it is the .values() command where I am going wrong.
>
>
> I really appreciate any comments or help
>
> Kind regards
>
> Liam Thompson
> Mölndal
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython