[Biopython] Storing SeqRecord objects with annotation

Thu Jul 23 09:32:39 EDT 2009

Hi Hilmar!

I've CC'd this to the BioSQL list. The start of the thread was here:
http://lists.open-bio.org/pipermail/biopython/2009-July/005385.html

On Thu, Jul 23, 2009 at 2:01 PM, Hilmar Lapp<hlapp at gmx.net> wrote:
>
> On Jul 23, 2009, at 6:20 AM, Peter wrote:
>
>> Currently the BioSQL schema doesn't have any explicit support
>> for "per letter annotation"
>
> I haven't been following the thread closely and so may be missing what is
> really meant by this. If, however, you mean associating annotation to a
> specific letter (position) in the sequence, BioSQL does support this - you'd
> create a seqfeature with appropriate location, and attach the annotation to
> the seqfeature.
>
> Bioentry annotations are location-less, by comparison.

By "per letter annotation" we mean essentially a list of annotation
data, with one entry for each letter in the sequence. For example,
a sequencing quality score (from a FASTQ file) where this is one
integer per letter (i.e. per base pair). Or, a secondary structure
prediction, encoded as one character per letter (which could
apply to proteins and nucleotides).

This sort of thing could be done by using on feature per letter,
but it would be dreadfully inefficient for storing in the database.

>> [...]
>> You can record any object in the SeqRecord's annotation dictionary.
>> However, saving the result to a file will be tricky - and it wouldn't
>> work in BioSQL either.
>
> Note that that's not entirely true. If you have a textual serialization
> (such as XML) of your object, you *can* store it in
> bioentry_qualifier_value. This is what we do in BioPerl with a TagTree
> annotation object that supports a nested hierarchical annotation
> structure needed for lossless representation of some UniProt lines.

This was what I mentioned earlier in the thread - using XML or
JSON to turn the object into a long string. However, we really need
the Bio* projects to agree on some standards here, rather than
each project adding its own additions ad hoc (which will make
interoperation much trickier). For example, I was unaware you
(BioPerl) had already pressed ahead with this for the UniProt
data - which rather proves my point.

> Obviously, that won't allow you to query very well by individual
> elements of your custom annotation object. But you can build a
> custom index (e.g., using Lucene) that does that.

Yes, doing searches on an XML/JSON encoded string is an issue.
But right now we are probably more interested in just solving the
persistence of more complex objects.

Peter