[Biopython-dev] SeqIO and qual: Question about reading and writing qual files

Tue Mar 31 13:12:37 EDT 2009

On Thu, Mar 26, 2009 at 12:30 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Mar 25, 2009 at 11:15 PM, Sebastian Bassi:
>>> Sebastian - could you have a quick play with this github code (using the new
>>> UnknownSeq class), and the current CVS code (using None), and make sure
>>> both support the slicing operations you were trying earlier?  Thanks.
>>
>> ...
>>
>> From a practical point of view, both versions are the same, but the
>> concept of UnknownSeq looks solid than None, because if I don't know
>> about about biopython internals, I would never try to slice a None
>> seq. With "None" ...
>> But with the UnknownSeq object, len(s) returns an actual length, so it
>> is more intuitive that it can be sliced.
>
> I agree the UnknownSeq is more intuitive - plus it makes the SeqRecord
> __getitem__ code nicer, and it means you can do len(SeqRecord) too,
> which was problematic if the sequence was None.

I've checked this into CVS after this discussion (and a little off thread).
I wasn't comfortable with using None for a sequence, and doing this
while also wanting to support len(...) and slicing on such SeqRecord
objects was basically horrible.

>> Then I tried the git code and it also worked. One thing I noticed is
>> that I got "?" instead of "N" the "sequence" of the UnknownSeq.
>
> I felt we shouldn't use an "N" unless we are confident the sequence
> is nucleotides.  In practice, this is probably a safe assumption for
> FASTQ and QUAL files - unless anyone can think of a counter example?
> Do you think it is safe to assume FASTQ and QUAL files are just for
> nucleotides?
>
> I mean, you could translate a CDS from transcriptome sequencing,
> and for the sake of argument give each amino acid a quality score
> from the three nucleotide quality scores, and then save this a protein
> FASTQ file.  But I've never heard of anyone actually doing this ;)

So, should we assume QUAL files (and perhaps FASTQ files) are
nucleotides when reading them in, and enforce this when writing
them out?  This would mean the QUAL files' UnknownSeq objects
would use the letter "N" instead of "?".

Or is it more generic to leave it as it is, and not make or force any
assumptions about the nature of the sequence?

Peter