[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?

Tue Mar 22 13:24:46 EDT 2011

On Tue, Mar 22, 2011 at 4:58 PM, Uri Laserson <laserson at mit.edu> wrote:
>> For this the GenBank/EMBL format with the source feature trick
>> does sound workable. You just need to be careful how how and
>> when you create the dummy source feature - I'd do it at the last
>> moment before writing out the file, and in that way you can avoid
>> things like slicing throwing it away.
>
> That's a good idea.  This should be even easier since I am subclassing
> SeqRecord.  I can override `format` to first take the whole annotations
> dictionary and dump it into the qualifiers dictionary of a `source` feature.
>  I also have my own parser which wraps SeqIO; using SeqIO to parse the
> 'imgt' format, I can then copy the `source` qualifiers to the annotations
> dictionary and delete `source` feature entirely.  Does this sound
> reasonable?

Yes, using your own parser/writer to take care to mapping between
the SeqRecord annotations dictionary and a dummy feature sounds
sensible. Also using 'imgt' rather than GenBank or EMBL will let you
have longer feature qualifier keys - but these files are not as widely
used/supported as the GenBank and EMBL formats.

>> I wonder if one of the INSDC XML formats would work nicely here?
>> i.e. If they can be extended more easily. We should look at adding a
>> parser for them to Biopython (and write support too ideally of course).
>
> My only issue with this is that I'd rather not extend anyone's file format,
> but use a standard file format that fits my purpose.  Otherwise, I might as
> well just go straight for a database, as below.  (But there are some
> super-fast XML parsers out there.)

I haven't looked at the details to see if those XML file formats have
a nice open ended misc annotation tag you could just use.

>> I was thinking you could use the BioSQL schema (run on SQLite if
>> you wanted to, or MySQL or PostgresSQL etc). You'd still face the
>> same issues if/when you wanted to dump the annotated records
>> to a plain text file though.
>
> I suppose plain text readability is less important to me than ease of
> sharing the data.  But when I dump a SeqRecord object to a BioSQL
> database, does it do it in a way that I can rebuild that object exactly
> with no loss of information? (I.e., does it solve the annotation dictionary
> problem that started this whole thread?)

Basically yes, subject to a few provisos, it should. Firstly note we
don't support any per-letter-annotation in BioSQL. Secondly, all
the SeqRecord annotations SeqFeature qualifiers will end up being
stored as strings (in table bioentry_qualifier_value and table
seqfeature_qualifier_value respectively). There may also be some
fun with string values vs single entry lists containing one string.

Peter