[Biopython] slice a record in two and writing both records
Dilara Ally
dilara.ally at gmail.com
Mon Jul 23 17:48:30 EDT 2012
Thanks. Itertools is a fantastic module!
Dilara
On Jul 20, 2012, at 3:07 AM, Peter Cock wrote:
> On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> If I have a function (modify_record) that slices up a SeqRecord into sub
>> records and then returns the sliced record if it has a certain length (for
>> e.g. the sliced record needs to be greater than 40bp), sometimes the
>> original record when sliced will have two different records both greater
>> than 40bp. I want to keep both sliced reads and rewrite them as separate
>> records into a single fastq file. Here is my code:
>>
>> def modify_record(frec, win, len_threshold):
>> quality_scores = array(frec.letter_annotations["phred_quality"])
>> all_window_qc = slidingWindow(quality_scores, win,1)
>> track_qc = windowQ(all_window_qc)
>> myzeros = boolean_array(track_qc, q_threshold,win)
>> Nrec = slice_points(myzeros,win)[0][1]-1
>> where_to_slice = slice_points(myzeros,win)[1]
>> where_to_slice.append(len(frec)+win)
>> sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>> return sub_record
>> ...
>
> The key point is that for each input record you may want to
> produce several output records. A single function turning
> one input SeqRecord into one output SeqRecord won't work.
> I would suggest either,
>
> 1. Modify your function to return a list of SeqRecord objects,
> which could be zero, one (as now), or several - depending on
> the slice points. Then use itertools.chain to combine them,
> something like this:
>
> from itertools import chain
> good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
>
> 2. Use a generator function to process the SeqRecord objects,
>
> def select_fragments(records, win, len_threshold):
> for record in records:
> where_to_slice = ...
> for slice_point in where_to_slice:
> yield record[slice_point]
>
> good_reads = select_fragments(SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
>
> Both these approaches are generator/iteration based and will
> be memory efficient.
>
> Note you may also want to alter the record identifiers so that
> different fragments from a single read get different IDs.
>
> Peter
More information about the Biopython
mailing list