[Biopython-dev] SeqIO and qual: Question about reading and writing qual files

Tue Mar 24 05:49:33 EDT 2009

On Tue, Mar 24, 2009 at 6:24 AM, Sebastian Bassi
<sbassi at clubdelarazon.org> wrote:
> I have a .fasta file and its corresponding .qual file.
> I run seqclean on the fasta file and I got a shorter .fasta file as
> output (that is expected).

Whose seqclean script are you using?  If it doesn't output the trimmed
qual file, can it work with FASTQ output instead?

> Using the .cln file from seqclean, I want to "trim" the .qual file the
> same way my new fasta is trimmed.
> I can read the cln and parse the information of "where to trim".
> For example, in one original sequence of 1000 bp, I may need to trim
> from 150 to 800.
> The problem is that I can't modify qual values using the new SeqIO
> qual parser (at least the size of the list can't be modified). I read
> the example in the doc, where it is cut doing something like:
> sub_rec = fullrec[150:800]
> But, this works only when there is a sequence (so, when read it as
> "fastq"), but it doesn't work when the sequence is read as "qual"
> (because there is no sequence ...
> So my question is: Does it make sense to allow the user to modify the
> size of the list in letter_annotations['phred_quality'] in qual
> sequences? I think this is a nice feature for qual SeqIO.parse.

This was one area of the new SeqRecord slicing I was a little unsure
about - slicing a qual file's SeqRecord (or any SeqRecord with a None
for the sequence).  I hadn't done anything about it immediately as I
couldn't think of a use case for it - so that's solved ;)

One solution would be to introduce an UnknownSeq object, which
would be much nicer to deal with than a None object, as it would have
a length and support slicing.  I've mentioned this idea before, but
haven't yet put forward any actual code.  This seems most elegant.

Another option would be to special case handle slicing a SeqRecord
with a None sequence, where we'd slice its per-letter-annotation. For
now, you can force this with the current code by:

#Not recommend, short term hack
s.letter_annotations._length = 6
s.letter_annotations['phred_quality'] = [0,0,0,0,10,1]

Right now, without changing Biopython, I have another workaround for
you: Use the paired reader in Bio.SeqIO.QualityIO on the untrimmed
FASTA and QUAL files, which will give you SeqRecords with both the
sequence and the quality - and trim these by slicing the SeqRecord.

Peter