[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?
Peter Cock
p.j.a.cock at googlemail.com
Tue Mar 22 11:30:46 EDT 2011
On Tue, Mar 22, 2011 at 3:08 PM, Uri Laserson <laserson at mit.edu> wrote:
>> You could stuff record level information into a source feature's
>> qualifier dictionary.
>
> What are the allowed types for the values of the qualifiers dictionary
> (that will be output correctly in INSDC)? Is it possible to have lists of
> strings?
As far as the current Biopython output goes, you can basically use any
(short) string as a qualifier key. Avoid keys with spaces in them (INSDC
use underscores) and other funny characters. For strict INSDC compliance
there is probably a white list of allowed feature types...
> What is the standard practice: a feature of type "source" that runs the
> entire length of the sequence? Or is it possible to have a SeqFeature with
> no position annotation? Ideally, if I slice the SeqFeature, I would like
> these annotations to stay with the slice no matter what.
If you did have a SeqFeature without a location, we couldn't write
it out in GenBank/EMBL format (the error handling here might be
improved).
If you have a SeqRecord with a (source) feature spanning the full
sequence, and you slice the SeqRecord to take a subsequence,
then that full length feature (and any other features not fully within
the subsequence) would be lost.
Using a source feature is really just a work around for the fact that
GenBank/EMBL do not support arbitrary record level annotation.
Do you have to use this as your output format? Would you not be
better off with using a database or something else instead?
Peter
More information about the Biopython
mailing list