[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?

Tue Mar 22 12:58:03 EDT 2011

>
> For this the GenBank/EMBL format with the source feature trick
> does sound workable. You just need to be careful how how and
> when you create the dummy source feature - I'd do it at the last
> moment before writing out the file, and in that way you can avoid
> things like slicing throwing it away.
>
>
That's a good idea.  This should be even easier since I am subclassing
SeqRecord.  I can override `format` to first take the whole annotations
dictionary and dump it into the qualifiers dictionary of a `source` feature.
 I also have my own parser which wraps SeqIO; using SeqIO to parse the
'imgt' format, I can then copy the `source` qualifiers to the annotations
dictionary and delete `source` feature entirely.  Does this sound
reasonable?

> I wonder if one of the INSDC XML formats would work nicely here?
> i.e. If they can be extended more easily. We should look at adding a
> parser for them to Biopython (and write support too ideally of course).
>

My only issue with this is that I'd rather not extend anyone's file format,
but use a standard file format that fits my purpose.  Otherwise, I might as
well just go straight for a database, as below.  (But there are some
super-fast XML parsers out there.)

> I was thinking you could use the BioSQL schema (run on SQLite if
> you wanted to, or MySQL or PostgresSQL etc). You'd still face the
> same issues if/when you wanted to dump the annotated records
> to a plain text file though.
>

I suppose plain text readability is less important to me than ease of
sharing the data.  But when I dump a SeqRecord object to a BioSQL database,
does it do it in a way that I can rebuild that object exactly with no loss
of information? (I.e., does it solve the annotation dictionary problem that
started this whole thread?)

Uri