[Biopython] Storing SeqRecord objects with annotation

Thu Jul 23 10:20:11 UTC 2009

Hi Andrea (and everyone else),

This is a continuation of a discussion started on Bug 2883. Andrea had
a problem with unpickling SeqRecord objects which were pickled using
an older version of Biopython. She was using pickle to store complicated
annotated SeqRecord objects on disk.

See http://bugzilla.open-bio.org/show_bug.cgi?id=2883 for details.

http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c6
On Bug 2883 comment 6, Peter wrote:
>>
>> If your SeqRecord objects are all simply loaded from sequence files in
>> the first place (and not modified), I would just keep the original file and
>> re-parse it.
>>
>> If you have generated your own SeqRecords (or modified those from
>> reading a file), then it makes sense to save them somehow. The choice
>> of file format depends on the nature of annotation. The latest Biopython
>> will now record the features in a GenBank file, making that a reasonable
>> choice - but this does not cover per-letter-annotations. BioSQL has the
>> same limitation.

http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c7
On Bug 2883 comment 7, Andrea wrote:
>
> yes, i'm testing some predictors. I do prediction and i compare the
> "newly predicted seqrecords" with the "previously correct predicted
> pickled seqrecords".

Sorry - when you said "test code" on the Bug discussion, I though you
meant you were testing the code - not that this was real work doing
biological tests.

> I've them (the correct ones) only in pickled seqrecord format. The
> correctly predicted seqrecord, before prediction were in fasta format,
> but after i parsed them (into seqrecord), i did prediction, and then
> i pickled them (during prediction i add to seqrecord features and
> annotations).

If you have SeqFeatures and SeqRecords with simple string based
annotation, then BioSQL should be fine.

If you have SeqFeatures, then using GenBank output might be enough.
There are no general fields in the GenBank format for arbitary
annotation though.

> Actually i don't use per-letter-annotation despite the fact it seems
> interesting. But i didn't find any example in documentation (that
> show how the dictionary is populated...) so i really don't know
> how to use it.... even if i've, during prediction, a "per position
> annotation".

You are right that the SeqRecord chapter in the Tutorial doesn't
explicitly cover populating the per-letter-annotation. I can fix that...

However, the built in documentation covers this (e.g. the section
on slicing a SeqRecord to get a sub-record):

>>> from Bio.SeqRecord import SeqRecord
>>> help(SeqRecord)
...

You can read this online:
http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html

> Also if the "per letter annotation" is not managed in the GenBank
> format or in the BioSQL format (that i use a lot) i've to wait!!

Currently the BioSQL schema doesn't have any explicit support
for "per letter annotation", but we could encode it as a string
(e.g. using XML or JSON) perhaps. This will require coordination
with BioSQL, BioPerl etc - and thus far no one has expressed a
strong need for this.

The GenBank file format simply doesn't have an concept of "per
letter annotation". The PFAM/Stockholm alignment format does
(for the special case of a single character per letter of the
sequence), and in sequencing the base quality is also held in
some file formats.

> I was thinking also to store the pssm information somewhere in the
> seqrecord.... but this would be a very big change... (and also
> manage to store it in BioSQL.... )... but it's better to stop
> the discussion here or to move it... :-)

You can record any object in the SeqRecord's annotation dictionary.
However, saving the result to a file will be tricky - and it wouldn't
work in BioSQL either.

Peter