[Biopython] slice a record in two and writing both records

Fri Jul 20 10:07:04 UTC 2012

On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> If I have a function (modify_record) that slices up a SeqRecord into sub
> records and then returns the sliced record if it has a certain length (for
> e.g. the sliced record needs to be greater than 40bp), sometimes the
> original record when sliced will have two different records both greater
> than 40bp.  I want to keep both sliced reads and rewrite them as separate
> records into a single fastq file.  Here is my code:
>
> def modify_record(frec, win, len_threshold):
>     quality_scores = array(frec.letter_annotations["phred_quality"])
>     all_window_qc = slidingWindow(quality_scores, win,1)
>     track_qc = windowQ(all_window_qc)
>     myzeros = boolean_array(track_qc, q_threshold,win)
>     Nrec = slice_points(myzeros,win)[0][1]-1
>     where_to_slice = slice_points(myzeros,win)[1]
>     where_to_slice.append(len(frec)+win)
>     sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>     return sub_record
> ...

The key point is that for each input record you may want to
produce several output records. A single function turning
one input SeqRecord into one output SeqRecord won't work.
I would suggest either,

1. Modify your function to return a list of SeqRecord objects,
which could be zero, one (as now), or several - depending on
the slice points. Then use itertools.chain to combine them,
something like this:

from itertools import chain
good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count

2. Use a generator function to process the SeqRecord objects,

def select_fragments(records, win, len_threshold):
    for record in records:
         where_to_slice = ...
         for slice_point in where_to_slice:
             yield record[slice_point]

good_reads = select_fragments(SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count

Both these approaches are generator/iteration based and will
be memory efficient.

Note you may also want to alter the record identifiers so that
different fragments from a single read get different IDs.

Peter