[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Mon Feb 23 05:42:13 EST 2009

On Mon, Feb 23, 2009 at 9:48 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> Hi all,
>
> On 22/02/2009 21:27, "Brad Chapman" <chapmanb at 50mail.com> wrote:
>
>> [...]
>>> I'm not sure if its exactly what Leighton has in mind, but it seems
>>> more complicated to have to do
>>> my_record.per_symbol_annotations["quality"]["phred"] rather than just
>>> my_record.per_symbol_annotations["quality_phred"].
>>
>> I'm agreed with you here -- the double dictionary I proposed is ugly
>> and doesn't do much of anything extra. I'm +1 on exactly what you wrote
>> here, and am not picky about the naming.
>
> I was originally suggesting two extremes, a lightweight dictionary and a
> more heavyweight new class.  I now prefer the lightweight option, which I
> imagine might operate along the lines of (keeping away from quality scores,
> for now...)
>
>>>> my_seqrecord
> SeqRecord(seq=Seq('FCLEPPYWYKNPGARTESRILRGGIID', Alphabet()),
> id='my_seqrecord', name='<unknown name>', description='<unknown
> description>', dbxrefs=[])
>>>> my_seqrecord.per_symbol_annotations['secondary_structure']
> 'HHHHHHEEEEEEE     EEEEEEEEE'
>>>> my_seqrecord.per_symbol_annotations['hydrophobicity']
> [0.823, 0.880, 0.987, 0.461, 0.706, 0.972, 0.109, 0.499, 0.908, 0.045,
> 0.493, 0.162, 0.796, 0.989, 0.419, 0.501, 0.686, 0.985, 0.502, 0.242, 0.890,
> 0.436, 0.855, 0.426, 0.814, 0.178, 0.923]
>>>> # Assuming that one day there's slicing of SeqRecords...
>>>> shorter_seqrecord = my_seqrecord[:10]
>>>> shorter_seqrecord.per_symbol_annotations['secondary_structure']
> 'HHHHHHEEEE"
>>>> shorter_seqrecord.per_symbol_annotations['hydrophobicity']
> [0.823, 0.880, 0.987, 0.461, 0.706, 0.972, 0.109, 0.499, 0.908, 0.045]
>
> Which I guess could be enforced in slice-handling by having it loop over the
> values (if any) in my_seqrecord.per_symbol_annotations and propagate
> accordingly.

This sounds like a possible consensus :)

In terms of names, we've have per_symbol_annotations and
per_letter_annotations (to match the existing annotations dictionary),
which are long but explicit.  We could also have letter_annotations,
symbol_annotations (shorter but more ambiguous), or even pas or pla
(too short?).

For the implementation, we could start with a simple dictionary and
see if any kind of safety feature should be added later if is seems
necessary.  What I had in mind was a dict subclass which takes the
sequence length, and by overriding the __setitem__ method checks only
python sequences (objects with __len__ and __getitem__) of the
appropriate length can be added.  This would add a small overhead when
creating the annotated SeqRecord, and wouldn't stop abuses like
my_seqrecord.per_symbol_annotations['secondary_structure'].append("X"),
but would make it harder to accidentally get inconsistent sequence and
per-letter-annotation.

> The more heavyweight idea involved a PerSymbolAnnotation (or somesuch name)
> class.  I imagined this presenting a common API, but permitting the storage
> of annotation data in an arbitrary fashion so long as it could be returned
> as a Python sequence.  The class-based approach would make it possible to
> attach methods specific to that kind of annotation data, which may be useful
> - but probably not in the vast majority of cases.  Also, any such operations
> could probably be handled external to the object by other functions, so long
> as they can get that Python sequence - which the more lightweight approach
> provides.

You could implement things like a SolexaQualityList and
PhredQualityList with methods to inter-convert the scores and still
use them within the per_letter_annotations approach described above.
One of the nice things about this dictionary approach is it would be
very flexible - you could also store an N by 3 numpy array containing
the x,y,z atomic coordinates of the C-alpha protein backbone for a
protein of length N, or a list of residue objects from our PDB parser.
 Anything which is a python sequence object (so lists, strings, tuples
for a start).

>> My vote is for bundling them together into a single row table using
>> json to stringify the lists. It's a nice compact representation and
>> will be well supported in any language. Python 2.6 has the
>> simplejson library bundled, so it's just a matter of doing:
>>
>> jsonified_list = json.dumps(the_quality_list)
>> the_quality_list = json.loads(jsonified_list)
>>
>> Since I've been doing more Javascript and Python, I appreciate not
>> munging lists into strings with obscure separators and really like
>> json. As a bonus, it looks just like Python.
>
> I don't like the idea of storing each per-symbol annotation (i.e. single
> score/annotation) in its own row, either.  I think that we all realise that
> approach could rapidly become hugely inefficient ;)  ...

For recording complex objects in a BioSQL database, using json sounds
like a simple cross language solution.  We should take this sub-topic
over to the BioSQL mailing list.  In terms of Biopython, we'd need to
be able to support old versions Python.  For simple cases like lists
of integers, or lists of floats, this is probably very straight
forward - but if we need full json support its a bit more tricky.
We'd want to use the BioSQL term/ontology features to indicate the
value is json encoded somehow.

Peter