[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?
chapmanb at 50mail.com
Sun Feb 22 21:27:42 UTC 2009
> I was actually thinking about adding a per_letter_annotations (or
> using Brad's suggested name per_symbol_annotations) dictionary which
> could hold phred qualities, solexa qualities, secondary structure,
> atomic coordinates - any python sequence (e.g. string, list or tuple)
> with a length matching the sequence. This would cover all the use
> cases I have come up with, and we can implement SeqRecord slicing
> which would also slice everything in the per_letter_annotations
> I'm not sure if its exactly what Leighton has in mind, but it seems
> more complicated to have to do
> my_record.per_symbol_annotations["quality"]["phred"] rather than just
I'm agreed with you here -- the double dictionary I proposed is ugly
and doesn't do much of anything extra. I'm +1 on exactly what you wrote
here, and am not picky about the naming.
> The only catch is the current tables only let us store
> strings. We could store each per-letter-annotation entry (e.g. a
> single quality score) as a separate table entry (where the rank tells
> us the correct order), but bundling them all into a single long table
> row might be more efficient. In the case of PHRED or Solexa scores,
> we could even use the FASTQ encoding (but a string "10, 20, 50, ..."
> might be more sensible). This would require some co-ordination with
> the other Bio* projects, probably on the BioSQL mailing list.
My vote is for bundling them together into a single row table using
json to stringify the lists. It's a nice compact representation and
will be well supported in any language. Python 2.6 has the
simplejson library bundled, so it's just a matter of doing:
jsonified_list = json.dumps(the_quality_list)
the_quality_list = json.loads(jsonified_list)
munging lists into strings with obscure separators and really like
json. As a bonus, it looks just like Python.
More information about the Biopython-dev