[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Fri Feb 20 14:15:50 UTC 2009

Another 2p... I collect them, you know...

An additional determinant of how these values are best scored is: "What will
they be used for?".

If the only use they would ever find was to accompany a sequence so that its
file format could be converted from one with embedded qualities to a format
that required two such files (or vice-versa), then straightforward storage
as a string in a dictionary is all that's needed.  This would be sufficient
for conversion between some quality scores, as a utility function could just
grab the stored string (given an appropriate name for each quality format).
The question of how these per-symbol annotations would be modified when
returning a Seq slice or join may be an issue.

If 'live' access to the values is required for calculation or alignment
purposes, then a different interface might be more useful, permitting
slicing, base selection on the basis of quality, or other operation.  This
use case is more complex, as the return value is likely to be dependent on
the quality format (single- or multiple-value per base).

Conceptually, I see quality scores as annotations of a sequence, rather than
an intrinsic property of the sequence, so am happy for them to live in the
same place other annotations do.  I also see them as only one instance of a
class of per-symbol annotations (along with hydrophobicity scores, secondary
structure predictions, read map counts and several other measures).  I
think, therefore, that there is a case for a class describing per-symbol
annotations to a Seq, and placing these in a dictionary of per-symbol
annotations.  Slices of the parent Seq could then be propagated downwards to
all members of that dictionary (which would also be expected to implement
the same string-like methods as the parent).

The per-symbol annotation objects could be subclassed and/or contain a
descriptive string from a controlled vocabulary to indicate their format,
for standard interfacing with external packages (e.g. Drawing TOPS diagrams
from secondary structure predictions or rendering base quality profiles),
which I think would be a flexible approach.

On 20/02/2009 11:49, "Jose Blanca" <jblanca at btc.upv.es> wrote:

>> I suppose you could consider adding a .phred_quality
>> property which is explicit, but then you'd end up with many different
>> properties.  Then there are other per-letter quality annotations - you
>> might want the A, C, G and T intensity from capillary sequencing (four
>> sets of numbers, not just one).  Plus of course this doesn't address
>> non-quality related per-letter-annotations (like secondary structure,
>> or atomic coordinates).
>> 
>> My point is that if we can't give top level properties to everything,
>> hence the original introduction of the annotations dictionary in the
>> first place.  Only a handful of really important things got their own
>> properties (id, name, description and the sequence itself).  If there
>> was only ONE key quality score, then I wouldn't mind making an
>> exception so much - but that doesn't seem to be the case.
> That's a very good point. It wouldn't be wise to populate the SeqRecord class
> with a lot of properties.
> Another posible approach would be to create a derived class for that a
> SeqWithQuality. It would be like a SeqRecord but with a .quality property.
> For other cases other classes could be derived from SeqRecord.
> The problem with putting the quatilies in a dict with all the other per base
> annotation is that it has a different behaviour than the .seq case. The seq
> case is special because is much more used, so maybe that's fair enough.
> I don't know, maybe it is wiser to set all the per case annotations in a dict
> a let the sequence outside. In that way we won't be creating a lot of new
> classes derived from SeqRecord.
> The more I think about the dict possibility, the more I like it.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________