[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?

Uri Laserson laserson at mit.edu
Tue Mar 22 16:58:03 UTC 2011


>
> For this the GenBank/EMBL format with the source feature trick
> does sound workable. You just need to be careful how how and
> when you create the dummy source feature - I'd do it at the last
> moment before writing out the file, and in that way you can avoid
> things like slicing throwing it away.
>
>
That's a good idea.  This should be even easier since I am subclassing
SeqRecord.  I can override `format` to first take the whole annotations
dictionary and dump it into the qualifiers dictionary of a `source` feature.
 I also have my own parser which wraps SeqIO; using SeqIO to parse the
'imgt' format, I can then copy the `source` qualifiers to the annotations
dictionary and delete `source` feature entirely.  Does this sound
reasonable?


> I wonder if one of the INSDC XML formats would work nicely here?
> i.e. If they can be extended more easily. We should look at adding a
> parser for them to Biopython (and write support too ideally of course).
>

My only issue with this is that I'd rather not extend anyone's file format,
but use a standard file format that fits my purpose.  Otherwise, I might as
well just go straight for a database, as below.  (But there are some
super-fast XML parsers out there.)


> I was thinking you could use the BioSQL schema (run on SQLite if
> you wanted to, or MySQL or PostgresSQL etc). You'd still face the
> same issues if/when you wanted to dump the annotated records
> to a plain text file though.
>

I suppose plain text readability is less important to me than ease of
sharing the data.  But when I dump a SeqRecord object to a BioSQL database,
does it do it in a way that I can rebuild that object exactly with no loss
of information? (I.e., does it solve the annotation dictionary problem that
started this whole thread?)

Uri



More information about the Biopython mailing list