[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?
Peter
biopython at maubp.freeserve.co.uk
Sat Feb 21 18:50:15 UTC 2009
On Fri, Feb 20, 2009 at 11:19 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
> Good points on this debate so far. What do you all think about a
> hybrid approach where the .quality attribute is a dictionary? The
> keys would be the quality type ("phred", "solexa"...) and the values
> would be a list or string the same length as the sequence.
I was actually thinking about adding a per_letter_annotations (or
using Brad's suggested name per_symbol_annotations) dictionary which
could hold phred qualities, solexa qualities, secondary structure,
atomic coordinates - any python sequence (e.g. string, list or tuple)
with a length matching the sequence. This would cover all the use
cases I have come up with, and we can implement SeqRecord slicing
which would also slice everything in the per_letter_annotations
dictionary.
Note that the per_letter_annotations dictionary could actually be a
simple subclass of the python dictionary that only allows you to add
elements with the appropriate length - this would prevent simple
abuses/accidental errors.
> For slicing, all of the quality dictionary values would be sliced
> identically to the sequence itself. For BioSQL storage the quality
> items would go in as annotations with names as a concatenation
> of the attribute and type ("quality_phred").
>
> Treating these specially on the BioSQL in/out is a little hack-y,
> but quality is likely important enough to not bury it.
If you are trying to store a sequence-with-quality in BioSQL, then yes
using the existing annotation tables could work - the ontology term
can tell us its a per-letter-annotation rather than a generic
annotation. The only catch is the current tables only let us store
strings. We could store each per-letter-annotation entry (e.g. a
single quality score) as a separate table entry (where the rank tells
us the correct order), but bundling them all into a single long table
row might be more efficient. In the case of PHRED or Solexa scores,
we could even use the FASTQ encoding (but a string "10, 20, 50, ..."
might be more sensible). This would require some co-ordination with
the other Bio* projects, probably on the BioSQL mailing list.
On the other hand, I don't expect anyone to try and store GB of
sequence+quality data in BioSQL. For this a custom database design
would be much more efficient (or at least some custom tables). Here
as Iddo points out, the SeqRecord object may be overkill.
> For Leighton's idea of generalization you could either:
>
> - Derive a heavy-weight SeqRecord class from the base class that
> added a several additional per-symbol cases.
>
> - Provide a generic per_symbol_annotations attribute that collected
> these as a dictionary of dictionaries:
>
> dict(quality = dict(phred = [20, 30]),
> hydrophobicity = dict(some_predictor = ['some', 'scores'])
> )
>
> These could map to generic attributes in the same way and follow the
> same slicing rules. After writing this up, I think the second idea
> is better and probably exactly what Leighton was proposing.
I'm not sure if its exactly what Leighton has in mind, but it seems
more complicated to have to do
my_record.per_symbol_annotations["quality"]["phred"] rather than just
my_record.per_symbol_annotations["quality_phred"]. I don't see much
benefit to the extra level of nesting - after all you'll typically
only have one type of quality present.
Peter
More information about the Biopython-dev
mailing list