[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Mon Feb 23 08:50:49 EST 2009

On Sat, Feb 21, 2009 at 7:50 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Feb 20, 2009 at 11:19 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> Hi all;
>> Good points on this debate so far. What do you all think about a
>> hybrid approach where the .quality attribute is a dictionary? The
>> keys would be the quality type ("phred", "solexa"...) and the values
>> would be a list or string the same length as the sequence.
>
> I was actually thinking about adding a per_letter_annotations (or

I suggest you to use github or any distribuited source versioning
system to test the changes you are describing in this discussion.

For example, I have created a branch on my github repository called
'qualityscores-experimental'
(http://github.com/dalloliogm/biopython/tree/qualityscores-experimental)
with a sample commit where I add a per_symbol_annotations attribute to
SeqRecord:
- http://github.com/dalloliogm/biopython/commit/7821d5f8cab1a5d7c4098c4b52f773b08a45969a

I think that it is easier to discuss over this if you can show how the
code would look like instead of only describing it.

> using Brad's suggested name per_symbol_annotations) dictionary which
> could hold phred qualities, solexa qualities, secondary structure,
> atomic coordinates - any python sequence (e.g. string, list or tuple)
> with a length matching the sequence.  This would cover all the use
> cases I have come up with, and we can implement SeqRecord slicing
> which would also slice everything in the per_letter_annotations
> dictionary.
>
> Note that the per_letter_annotations dictionary could actually be a
> simple subclass of the python dictionary that only allows you to add
> elements with the appropriate length - this would prevent simple
> abuses/accidental errors.
>
>> For slicing, all of the quality dictionary values would be sliced
>> identically to the sequence itself. For BioSQL storage the quality
>> items would go in as annotations with names as a concatenation
>> of the attribute and type ("quality_phred").
>>
>> Treating these specially on the BioSQL in/out is a little hack-y,
>> but quality is likely important enough to not bury it.
>
> If you are trying to store a sequence-with-quality in BioSQL, then yes
> using the existing annotation tables could work - the ontology term
> can tell us its a per-letter-annotation rather than a generic
> annotation.  The only catch is the current tables only let us store
> strings.  We could store each per-letter-annotation entry (e.g. a
> single quality score) as a separate table entry (where the rank tells
> us the correct order), but bundling them all into a single long table
> row might be more efficient.  In the case of PHRED or Solexa scores,
> we could even use the FASTQ encoding (but a string "10, 20, 50, ..."
> might be more sensible).  This would require some co-ordination with
> the other Bio* projects, probably on the BioSQL mailing list.
>
> On the other hand, I don't expect anyone to try and store GB of
> sequence+quality data in BioSQL.  For this a custom database design
> would be much more efficient (or at least some custom tables).  Here
> as Iddo points out, the SeqRecord object may be overkill.
>
>> For Leighton's idea of generalization you could either:
>>
>> - Derive a heavy-weight SeqRecord class from the base class that
>>  added a several additional per-symbol cases.
>>
>> - Provide a generic per_symbol_annotations attribute that collected
>>  these as a dictionary of dictionaries:
>>
>>  dict(quality = dict(phred = [20, 30]),
>>       hydrophobicity = dict(some_predictor = ['some', 'scores'])
>>      )
>>
>> These could map to generic attributes in the same way and follow the
>> same slicing rules. After writing this up, I think the second idea
>> is better and probably exactly what Leighton was proposing.
>
> I'm not sure if its exactly what Leighton has in mind, but it seems
> more complicated to have to do
> my_record.per_symbol_annotations["quality"]["phred"] rather than just
> my_record.per_symbol_annotations["quality_phred"].  I don't see much
> benefit to the extra level of nesting - after all you'll typically
> only have one type of quality present.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it