[Biopython] Storing SeqRecord objects with annotation

Thu Jul 23 08:54:47 EDT 2009

On Thu, Jul 23, 2009 at 1:23 PM, Andrea<andrea at biodec.com> wrote:
>
> To be precise i'm really testing code, my code. My predictors are
> implemented in python and to be shure that during time, bug fixes,
> modifications.. i won't alter the prediction results, i build some
> unittest to compare the results of the modified code with the results
> of the old code.
>
>Peter wrote:
>> If you have SeqFeatures and SeqRecords with simple string based
>> annotation, then BioSQL should be fine.
>
> According to me, for unittesting purposes, using Biosql for storing data
> is quite expensive  in term of code (or it seems so...), despite the fact,
> actually, BioSQL is for sure fine for storing  my annotations and
> features.
>
>> If you have SeqFeatures, then using GenBank output might be
>> enough. There are no general fields in the GenBank format for
>> arbitrary annotation though.
>
> Yes, i think that GenBank wont store my "peronal annotations"
> (or i've to check it).
>
>>> Actually i don't use per-letter-annotation despite the fact it seems
>>> interesting. But i didn't find any example in documentation (that
>>> show how the dictionary is populated...) so i really don't know
>>> how to use it.... even if i've, during prediction, a "per position
>>> annotation".
>>
>> You are right that the SeqRecord chapter in the Tutorial doesn't
>> explicitly cover populating the per-letter-annotation. I can fix that...

The next version of the Tutorial will include a short example of this.

>> However, the built in documentation covers this (e.g. the section
>> on slicing a SeqRecord to get a sub-record):
>>
>> from Bio.SeqRecord import SeqRecord
>> help(SeqRecord)
>> ...
>>
>> You can read this online:
>> http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
>
> Very interesting and easy to use. I can either use it for:
>   - storing per position string representing the "per position label"
>     of the prediction
>   - storing list of per position reliabilities (raliability of prediction)
>   - storing sequence variant
>   - storing possible aligned sequence
> But it's a pity that this is not yet managed in BioSQL ....

Some of those might be possible using SeqFeature objects,
but I agree, the  "per letter annotation" seems more suitable.

> Also if the "per letter annotation" is not managed in the GenBank
> format or in the BioSQL format (that i use a lot) i've to wait!!

Some special cases of "per letter annotation" are supported for
file output (PFAM/Stockholm alignments, FASTQ, and QUAL),
but that's it. The idea of the SeqRecord "per letter annotation"
was to be sufficiently general to cover these and other future
uses.

>> Currently the BioSQL schema doesn't have any explicit support
>> for "per letter annotation", but we could encode it as a string
>> (e.g. using XML or JSON) perhaps. This will require coordination
>> with BioSQL, BioPerl etc - and thus far no one has expressed a
>> strong need for this.
>>
>> ...
>>
>> You can record any object in the SeqRecord's annotation
>> dictionary. However, saving the result to a file will be tricky -
>> and it wouldn't work in BioSQL either.
>
> I could say that i will use it, if it will work in biosql... but until
> there won't be the  possibility to store this information (BioSQL,
> GenBank...) i think the "per letter annotation" will lose part of its
> "charme"....

Currently BioSQL just stores strings for general annotation.
I think extending BioSQL to store simple per-letter-annotation
would be possible - for example strings, integers, and floating
point numbers. However, storing objects like a PSSM might
not be possible as we would want this to be compatible
between the other Bio* bindings.

Peter