[BioSQL-l] Storing "per letter" annotation?

Sun May 25 00:16:31 UTC 2008

Comments below.

On May 24, 2008, at 8:21 AM, Peter wrote:
> This is a BioSQL related query - but first a little background:
>
> One topic that has recently come up on the Biopython developers
> mailing list is extending our sequence classes to deal with "per
> letter" annotation.  This annotation should then survive splicing the
> sequence into sub-strings for example.
>
> For example, with nucleotide sequences, each base-pair may have an
> associated quality score (one float per bp).  Or perhaps you might
> have a contig region where for each bp you want to record the number
> of fragments it is supported by (one integer per bp).
>
> Similarly, for proteins, you might know the secondary structure (for
> example held as a character per amino acid, a = alpha helix etc).  For
> a PDB file, you might want to have an object for each residue holding
> an associated set of atomic coordinates, or may just the C-alpha back
> bone coordinates (three floats per residue).  One final motivating
> example, you might want to hold the solvent accessibility of each
> residue (one float per residue).
>
> First of all, have any of the other Bio* project implemented anything
> like this?  If so, I'd like to have a look at the relevant
> documentation (and depending on the language, even the
> implementation).  And secondly, how would you go about storing it in
> BioSQL?  As far as I can see, there isn't anything in BioSQL at the
> moment suitable (other than abusing the sequence features).

It sounds like in essence you want to store alternative sequences in  
other alphabets for a sequence?

In BioPerl we have Bio::Seq::SeqWithQuality and the more generic  
Bio::Seq::MetaI. However, BioSQL in v1.0 really only supports a 1-1  
relationship between Bioentry and Biosequence, i.e., a bioentry can  
only have a single sequence, and hence additional sequences (quality  
values, secondary structure, etc) would need to be stored as a flat  
annotation value, or through a the biosequence of a second bioentry  
that is linked to the first through a bioentry_relationship.  
(Biosequence in principle allows any alphabet.)

Neither of those kludges (in fact, quite bad hacks) seem particularly  
attractive, so this would actually be another use case in favor of  
relaxing the 1-1 cardinality constraint to one that's 1-n. (Feel free  
to add this to the roadmap on the wiki.)

As you say, you could indeed do this using seqfeatures too, but  
that'd be an abuse.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================