[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?

Tue Mar 22 15:44:02 UTC 2011

>
> As far as the current Biopython output goes, you can basically use any
> (short) string as a qualifier key.
>

Sorry, I meant for the values, not the keys.  Can you have a list of strings
as a value?

> Using a source feature is really just a work around for the fact that
> GenBank/EMBL do not support arbitrary record level annotation.
> Do you have to use this as your output format?

Agreed.  Essentially, I have a huge pile of sequencing reads that are highly
annotated.  For any given read, there are some annotations that are
independent of the sequence itself (which is what I am trying to implement
now) and there are some annotations that are associated with subsequences
(which is why SeqFeatures are very appropriate).  Ideally, I want a file
format that will store the data, be easily parsable (and fast), and can be
readable using something like `less` (though this last feature is less
important).

> Would you not be
> better off with using a database or something else instead?
>

Well, initially I used XML to store the data, but I quickly realized I was
reinventing the wheel, especially when it came to annotating features on top
of the sequences.

Are you suggesting something like SQLite?  How would I deal with
SeqFeature-type annotations?

Uri

>  Peter
>