[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?

Tue Mar 22 15:30:46 UTC 2011

On Tue, Mar 22, 2011 at 3:08 PM, Uri Laserson <laserson at mit.edu> wrote:
>> You could stuff record level information into a source feature's
>> qualifier dictionary.
>
> What are the allowed types for the values of the qualifiers dictionary
> (that will be output correctly in INSDC)?  Is it possible to have lists of
> strings?

As far as the current Biopython output goes, you can basically use any
(short) string as a qualifier key. Avoid keys with spaces in them (INSDC
use underscores) and other funny characters. For strict INSDC compliance
there is probably a white list of allowed feature types...

> What is the standard practice: a feature of type "source" that runs the
> entire length of the sequence?  Or is it possible to have a SeqFeature with
> no position annotation?  Ideally, if I slice the SeqFeature, I would like
> these annotations to stay with the slice no matter what.

If you did have a SeqFeature without a location, we couldn't write
it out in GenBank/EMBL format (the error handling here might be
improved).

If you have a SeqRecord with a (source) feature spanning the full
sequence, and you slice the SeqRecord to take a subsequence,
then that full length feature (and any other features not fully within
the subsequence) would be lost.

Using a source feature is really just a work around for the fact that
GenBank/EMBL do not support arbitrary record level annotation.
Do you have to use this as your output format? Would you not be
better off with using a database or something else instead?

Peter