[Biopython] slice a record in two and writing both records
Dilara Ally
dilara.ally at gmail.com
Thu Jul 19 11:51:35 EDT 2012
If I have a function (modify_record) that slices up a SeqRecord into sub
records and then returns the sliced record if it has a certain length (for
e.g. the sliced record needs to be greater than 40bp), sometimes the
original record when sliced will have two different records both greater
than 40bp. I want to keep both sliced reads and rewrite them as separate
records into a single fastq file. Here is my code:
def modify_record(frec, win, len_threshold):
quality_scores = array(frec.letter_annotations["phred_quality"])
all_window_qc = slidingWindow(quality_scores, win,1)
track_qc = windowQ(all_window_qc)
myzeros = boolean_array(track_qc, q_threshold,win)
Nrec = slice_points(myzeros,win)[0][1]-1
where_to_slice = slice_points(myzeros,win)[1]
where_to_slice.append(len(frec)+win)
sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
return sub_record
q_threshold = 20
win = 5
len_threshold = 30
from Bio import SeqIO
from numpy import *
good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if
array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold)
count = SeqIO.write(good_reads, "temp.fastq", "fastq")
print "Saved %i reads" % count
newly_filtered=[]
for rec in SeqIO.parse("temp.fastq", "fastq"):
s = modify_record(rec, win, len_threshold)
newly_filtered.append(s)
SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")
This writes only the first sub_record even when there are more than 1 that
have a len >40bp. I've tried this as a generator expression and I'm still
getting just the first sub_record. I'd also prefer to not to use append
as it was previously suggested that this can lead to problems if you run
the script more than once. Instead, I want to employ a generator
expression - but I'm still getting used to the idea of generator
expressions.
My second question is more general. Generator expressions are more memory
efficient than a list comprehension, but how are they better than just a
simple loop that pulls in a single record, does something and then writes
that record? Is it just a time issue?
Many thanks for the help!
More information about the Biopython
mailing list