[Biopython] slice a record in two and writing both records

Mon Jul 23 21:48:30 UTC 2012

Thanks.  Itertools is a fantastic module!
Dilara

On Jul 20, 2012, at 3:07 AM, Peter Cock wrote:

> On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> If I have a function (modify_record) that slices up a SeqRecord into sub
>> records and then returns the sliced record if it has a certain length (for
>> e.g. the sliced record needs to be greater than 40bp), sometimes the
>> original record when sliced will have two different records both greater
>> than 40bp.  I want to keep both sliced reads and rewrite them as separate
>> records into a single fastq file.  Here is my code:
>> 
>> def modify_record(frec, win, len_threshold):
>>    quality_scores = array(frec.letter_annotations["phred_quality"])
>>    all_window_qc = slidingWindow(quality_scores, win,1)
>>    track_qc = windowQ(all_window_qc)
>>    myzeros = boolean_array(track_qc, q_threshold,win)
>>    Nrec = slice_points(myzeros,win)[0][1]-1
>>    where_to_slice = slice_points(myzeros,win)[1]
>>    where_to_slice.append(len(frec)+win)
>>    sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>>    return sub_record
>> ...
> 
> The key point is that for each input record you may want to
> produce several output records. A single function turning
> one input SeqRecord into one output SeqRecord won't work.
> I would suggest either,
> 
> 1. Modify your function to return a list of SeqRecord objects,
> which could be zero, one (as now), or several - depending on
> the slice points. Then use itertools.chain to combine them,
> something like this:
> 
> from itertools import chain
> good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
> 
> 2. Use a generator function to process the SeqRecord objects,
> 
> def select_fragments(records, win, len_threshold):
>    for record in records:
>         where_to_slice = ...
>         for slice_point in where_to_slice:
>             yield record[slice_point]
> 
> good_reads = select_fragments(SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
> 
> Both these approaches are generator/iteration based and will
> be memory efficient.
> 
> Note you may also want to alter the record identifiers so that
> different fragments from a single read get different IDs.
> 
> Peter