[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Mon Feb 23 04:48:07 EST 2009

Hi all,

On 22/02/2009 21:27, "Brad Chapman" <chapmanb at 50mail.com> wrote:

> [...]
>> I'm not sure if its exactly what Leighton has in mind, but it seems
>> more complicated to have to do
>> my_record.per_symbol_annotations["quality"]["phred"] rather than just
>> my_record.per_symbol_annotations["quality_phred"].
> 
> I'm agreed with you here -- the double dictionary I proposed is ugly
> and doesn't do much of anything extra. I'm +1 on exactly what you wrote
> here, and am not picky about the naming.

I was originally suggesting two extremes, a lightweight dictionary and a
more heavyweight new class.  I now prefer the lightweight option, which I
imagine might operate along the lines of (keeping away from quality scores,
for now...)

>>> my_seqrecord
SeqRecord(seq=Seq('FCLEPPYWYKNPGARTESRILRGGIID', Alphabet()),
id='my_seqrecord', name='<unknown name>', description='<unknown
description>', dbxrefs=[])
>>> my_seqrecord.per_symbol_annotations['secondary_structure']
'HHHHHHEEEEEEE     EEEEEEEEE'
>>> my_seqrecord.per_symbol_annotations['hydrophobicity']
[0.823, 0.880, 0.987, 0.461, 0.706, 0.972, 0.109, 0.499, 0.908, 0.045,
0.493, 0.162, 0.796, 0.989, 0.419, 0.501, 0.686, 0.985, 0.502, 0.242, 0.890,
0.436, 0.855, 0.426, 0.814, 0.178, 0.923]
>>> # Assuming that one day there's slicing of SeqRecords...
>>> shorter_seqrecord = my_seqrecord[:10]
>>> shorter_seqrecord.per_symbol_annotations['secondary_structure']
'HHHHHHEEEE"
>>> shorter_seqrecord.per_symbol_annotations['hydrophobicity']
[0.823, 0.880, 0.987, 0.461, 0.706, 0.972, 0.109, 0.499, 0.908, 0.045]

Which I guess could be enforced in slice-handling by having it loop over the
values (if any) in my_seqrecord.per_symbol_annotations and propagate
accordingly.

The more heavyweight idea involved a PerSymbolAnnotation (or somesuch name)
class.  I imagined this presenting a common API, but permitting the storage
of annotation data in an arbitrary fashion so long as it could be returned
as a Python sequence.  The class-based approach would make it possible to
attach methods specific to that kind of annotation data, which may be useful
- but probably not in the vast majority of cases.  Also, any such operations
could probably be handled external to the object by other functions, so long
as they can get that Python sequence - which the more lightweight approach
provides.

Most people's attention here seems to be focused on sequence quality data,
with a skew towards high-throughput sequencing, and the lightweight approach
is the one that definitely makes most sense to me, there.

>> The only catch is the current tables only let us store
>> strings.  We could store each per-letter-annotation entry (e.g. a
>> single quality score) as a separate table entry (where the rank tells
>> us the correct order), but bundling them all into a single long table
>> row might be more efficient.  In the case of PHRED or Solexa scores,
>> we could even use the FASTQ encoding (but a string "10, 20, 50, ..."
>> might be more sensible).  This would require some co-ordination with
>> the other Bio* projects, probably on the BioSQL mailing list.
> 
> My vote is for bundling them together into a single row table using
> json to stringify the lists. It's a nice compact representation and
> will be well supported in any language. Python 2.6 has the
> simplejson library bundled, so it's just a matter of doing:
> 
> jsonified_list = json.dumps(the_quality_list)
> the_quality_list = json.loads(jsonified_list)
> 
> Since I've been doing more Javascript and Python, I appreciate not
> munging lists into strings with obscure separators and really like
> json. As a bonus, it looks just like Python.

I don't like the idea of storing each per-symbol annotation (i.e. single
score/annotation) in its own row, either.  I think that we all realise that
approach could rapidly become hugely inefficient ;)  I can see that pulling
out individual symbol annotations might be desirable when people want slices
of the annotation in units smaller than a single seqfeature or bioentry (in
BioSQL terms). In those cases, on grounds of efficiency, I think it possibly
makes more sense to grab either the seqfeature or bioentry (since the
per-symbol annotations would always be associated with such an object) as a
SeqRecord and slice out the data, rather than to query a table with what
would likely be (at least eventually) millions of rows of per-symbol
annotations.  That possibly means adding slicing to SeqRecords though, which
brings its own problems... ;)

Storage of per-symbol annotation as Python sequence information in a single
db row, in a human-readable plain-text format that's readily-parsable when
querying the database with Biopython looks like a winning approach to me.

I'd not come across json before - it does remind me of nested Python
dictionaries.  It looks simple to use and parse, and reverse-engineerable if
necessary.  If it's robust to the kind of data we want to store, and a de
facto or actual standard usable transparently across all Bio* projects, then
it sounds like a good candidate, to me.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________