[Biopython] slice a record in two and writing both records

Thu Jul 19 17:21:42 UTC 2012

Hi Dilara,

For your first question, it seems that the `modify_record` function
always returns only one SeqRecord object. This is a bit of a guesswork
from my end as I don't know how most of the functions in `modify_record` work,
but since you still see an ouput sequence at the end, I
think you may want to re-check again how `sub_record` returns its
values / how it returns more than one SeqRecord objects.

Also, you might want to try changing the last two lines:

    newly_filtered.append(s)
    SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")

Here, you're doing `SeqIO.write` for each iteration of the loop.
Although the end result is the same (a file containing all the sequence
you want), the code may be made more efficient by putting the
SeqIO.write line outside of the loop, after all sequences are pooled
in the `newly_filtered` list.

For your second question, I personally find generator expressions to
be more compact and easier to read. This is important for future code
maintenance ~ having more readable lines of code means it's easier to
understand your code and to debug them in case something goes wrong.
Note that generator expressions aren't silver bullets. In some cases,
for loops may still be better (e.g. if you're doing complex operations
on the objects your iterating over).

I find these two sites helpful when I first grappled with generators
and generator expressions. I hope they are the same to you too:

* http://stackoverflow.com/questions/1995418/python-generator-expression-vs-yield
* http://www.dabeaz.com/generators/Generators.pdf (PDF)

Hope that helps :),
Bow

On Thu, Jul 19, 2012 at 5:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> If I have a function (modify_record) that slices up a SeqRecord into sub
> records and then returns the sliced record if it has a certain length (for
> e.g. the sliced record needs to be greater than 40bp), sometimes the
> original record when sliced will have two different records both greater
> than 40bp.  I want to keep both sliced reads and rewrite them as separate
> records into a single fastq file.  Here is my code:
>
> def modify_record(frec, win, len_threshold):
>     quality_scores = array(frec.letter_annotations["phred_quality"])
>     all_window_qc = slidingWindow(quality_scores, win,1)
>     track_qc = windowQ(all_window_qc)
>     myzeros = boolean_array(track_qc, q_threshold,win)
>     Nrec = slice_points(myzeros,win)[0][1]-1
>     where_to_slice = slice_points(myzeros,win)[1]
>     where_to_slice.append(len(frec)+win)
>     sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>     return sub_record
>
> q_threshold = 20
> win = 5
> len_threshold = 30
>
> from Bio import SeqIO
> from numpy import *
> good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if
> array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold)
> count = SeqIO.write(good_reads, "temp.fastq", "fastq")
> print "Saved %i reads" % count
>
> newly_filtered=[]
> for rec in SeqIO.parse("temp.fastq", "fastq"):
>     s = modify_record(rec, win, len_threshold)
>     newly_filtered.append(s)
>     SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")
>
> This writes only the first sub_record even when there are more than 1 that
> have a len >40bp. I've tried this as a generator expression and I'm still
> getting just the first sub_record.   I'd also prefer to not to use append
> as it was previously suggested that this can lead to problems if you run
> the script more than once.  Instead, I want to employ a generator
> expression - but I'm still getting used to the idea of generator
> expressions.
>
> My second question is more general.  Generator expressions are more memory
> efficient than a list comprehension, but how are they better than just a
> simple loop that pulls in a single record, does something and then writes
> that record? Is it just a time issue?
>
> Many thanks for the help!
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython