[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Sat Feb 21 18:50:15 UTC 2009

On Fri, Feb 20, 2009 at 11:19 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
> Good points on this debate so far. What do you all think about a
> hybrid approach where the .quality attribute is a dictionary? The
> keys would be the quality type ("phred", "solexa"...) and the values
> would be a list or string the same length as the sequence.

I was actually thinking about adding a per_letter_annotations (or
using Brad's suggested name per_symbol_annotations) dictionary which
could hold phred qualities, solexa qualities, secondary structure,
atomic coordinates - any python sequence (e.g. string, list or tuple)
with a length matching the sequence.  This would cover all the use
cases I have come up with, and we can implement SeqRecord slicing
which would also slice everything in the per_letter_annotations
dictionary.

Note that the per_letter_annotations dictionary could actually be a
simple subclass of the python dictionary that only allows you to add
elements with the appropriate length - this would prevent simple
abuses/accidental errors.

> For slicing, all of the quality dictionary values would be sliced
> identically to the sequence itself. For BioSQL storage the quality
> items would go in as annotations with names as a concatenation
> of the attribute and type ("quality_phred").
>
> Treating these specially on the BioSQL in/out is a little hack-y,
> but quality is likely important enough to not bury it.

If you are trying to store a sequence-with-quality in BioSQL, then yes
using the existing annotation tables could work - the ontology term
can tell us its a per-letter-annotation rather than a generic
annotation.  The only catch is the current tables only let us store
strings.  We could store each per-letter-annotation entry (e.g. a
single quality score) as a separate table entry (where the rank tells
us the correct order), but bundling them all into a single long table
row might be more efficient.  In the case of PHRED or Solexa scores,
we could even use the FASTQ encoding (but a string "10, 20, 50, ..."
might be more sensible).  This would require some co-ordination with
the other Bio* projects, probably on the BioSQL mailing list.

On the other hand, I don't expect anyone to try and store GB of
sequence+quality data in BioSQL.  For this a custom database design
would be much more efficient (or at least some custom tables).  Here
as Iddo points out, the SeqRecord object may be overkill.

> For Leighton's idea of generalization you could either:
>
> - Derive a heavy-weight SeqRecord class from the base class that
>  added a several additional per-symbol cases.
>
> - Provide a generic per_symbol_annotations attribute that collected
>  these as a dictionary of dictionaries:
>
>  dict(quality = dict(phred = [20, 30]),
>       hydrophobicity = dict(some_predictor = ['some', 'scores'])
>      )
>
> These could map to generic attributes in the same way and follow the
> same slicing rules. After writing this up, I think the second idea
> is better and probably exactly what Leighton was proposing.

I'm not sure if its exactly what Leighton has in mind, but it seems
more complicated to have to do
my_record.per_symbol_annotations["quality"]["phred"] rather than just
my_record.per_symbol_annotations["quality_phred"].  I don't see much
benefit to the extra level of nesting - after all you'll typically
only have one type of quality present.

Peter