[Biopython-dev] about the SeqRecord and SeqFeature classes

Fri Sep 26 13:43:25 UTC 2008

Hi Jose,

>> I see you have created a subclass [of] the SeqRecord to add a quality
>> property, and made sure this gets sliced too in the __getitem__.  This
>> is a nice approach (and demonstrates how people could extend the basic
>> Biopython objects in their own code).  I would also suggest in the
>> __init__ method checking that the quality sequence is the same length
>> as the sequence itself.
>
> To do that in a proper way I would like to use property, that's why I was
> asking for the possibility of transforming SeqRecord and Seq in new style
> classes.

Oh I see - then you could put the length check in the property set method?

Would you like to file an enhancement bug for transforming SeqRecord
and Seq into new style classes, and prepare a patch (for this only)?
If this doesn't cause any problems with the unit tests then I don't
foresee any problems getting that change made.

>> If we were to add something like this to Biopython directly, I prefer
>> "quality" over "qual" (just three letters longer but much clearer).
>
> That's not a problem. I used qual to do it similar to .seq

Style is often debatable.  Sequence is quite long, and seq is fairly
clear.  Qual on the other hand could be short for qualifier (a term
used in feature annotation).

>> I would also consider adding the quality to the Seq object (subclassing
>> the Seq object rather than the SeqRecord object).  My reasoning is
>> that for 454 or Solexa sequencing, you will have thousands of reads
>> and all you really care about is the nucleotide sequence and the
>> quality scores.  Unless you want to give them all unique names, there
>> little point having the overhead of the various annotation properties
>> of the SeqRecord.
>
> I didn't subclass Seq because if we want a quality without name we could just
> use a tuple or a list. My idea was to create a class with two main
> properties, seq and qual (or quality). ...
> I agree that this classes should be prepared to deal with a lot of sequences
> and they should be efficient. But I don't have the experience to foresee
> which model would be better in that regard.

I haven't had to deal with 454 or solexa sequence data yet (but I am
hoping to in the next six months).  Given there are lots of possible
implementation/object structure ideas, I think it might be premature
to pick one for Biopython right now.  Would you be happy with the
SeqRecord __getitem__ method (Bug 2507) and creating the subclassed
SeqRecord with quality in your own code?  If you find that works well
in real usage, it would be encouraging for us to use it Biopython.  Or
have you already been using something like this for serious data
analysis?

Peter