[Biopython] index_db or two separate indices

Wed Oct 19 11:03:28 UTC 2016

Hi everyone

I'm attempting to demultiplex PE reads from an Illumina run (2 files x 
3.5gb).

I thought of creating an index_db containing both R1 and R2 reads as I 
need to pull out each pair R1 and R2 read, identify the primer+barcode 
sequence in the read sequence, and put the sequence in its designated file.

My problem is that reading the files into index_db creates a problem 
with duplicate keys as the ID does not seem to include the 1 or 2 strand 
designation as found in the header (perhaps it is not stricly part of 
the header), and as the callback function only contains the ID, I can't 
access the other fields one would normally be able to with SeqRecord.

index_list = SeqIO.index_db(idx_name, 
["sorted_5000_R1.fq","sorted_5000_R2.fq"], 'fastq', generic_dna, get_record)

Is it best then to just create two separate indices using SeqIO.index 
and pull out the sequences from there ? I would prefer to not have to 
load both indices into memory, though perhaps it is not as big as I 
think it might be.

Any suggestions ?

Thanks
Liam

Gothenburg, Sweden