[Biopython] slice a record in two and writing both records
Peter Cock
p.j.a.cock at googlemail.com
Fri Jul 20 06:07:04 EDT 2012
On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> If I have a function (modify_record) that slices up a SeqRecord into sub
> records and then returns the sliced record if it has a certain length (for
> e.g. the sliced record needs to be greater than 40bp), sometimes the
> original record when sliced will have two different records both greater
> than 40bp. I want to keep both sliced reads and rewrite them as separate
> records into a single fastq file. Here is my code:
>
> def modify_record(frec, win, len_threshold):
> quality_scores = array(frec.letter_annotations["phred_quality"])
> all_window_qc = slidingWindow(quality_scores, win,1)
> track_qc = windowQ(all_window_qc)
> myzeros = boolean_array(track_qc, q_threshold,win)
> Nrec = slice_points(myzeros,win)[0][1]-1
> where_to_slice = slice_points(myzeros,win)[1]
> where_to_slice.append(len(frec)+win)
> sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
> return sub_record
> ...
The key point is that for each input record you may want to
produce several output records. A single function turning
one input SeqRecord into one output SeqRecord won't work.
I would suggest either,
1. Modify your function to return a list of SeqRecord objects,
which could be zero, one (as now), or several - depending on
the slice points. Then use itertools.chain to combine them,
something like this:
from itertools import chain
good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count
2. Use a generator function to process the SeqRecord objects,
def select_fragments(records, win, len_threshold):
for record in records:
where_to_slice = ...
for slice_point in where_to_slice:
yield record[slice_point]
good_reads = select_fragments(SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count
Both these approaches are generator/iteration based and will
be memory efficient.
Note you may also want to alter the record identifiers so that
different fragments from a single read get different IDs.
Peter
More information about the Biopython
mailing list