[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?

Peter Cock p.j.a.cock at googlemail.com
Tue Mar 22 09:22:17 UTC 2011


On Mon, Mar 21, 2011 at 11:38 PM, Uri Laserson <laserson at mit.edu> wrote:
> If I load a GenBank-formatted record:
>
>    a = SeqIO.parse('myfile.gb','gb').next()
>
> then set an annotation:
>
>    a.annotations['myannotation'] = 'saveme'
>
> and then format the SeqRecord object as GenBank:
>
>    a.format('gb')
>
> then 'myannotation' is lost.

It isn't 'lost' in that it is still in your SeqRecord object in
memory, but it isn't in the GenBank format output.

> Is this expected behavior?

Yes, there is no general field for record level annotation in the
GenBank or EMBL file formats. Where did you expect it to be
written? The same thing would happen with most file formats,
e.g. FASTA has no annotation support at all beyond the free
text description line.

> If so, that's a huge bummer...what is the suggested method to
> store my own annotations in INSDC formats?

You could stuff record level information into a source feature's
qualifier dictionary. It isn't elegant, but it would work. The NCBI
seems to have introduced the source feature primarily to use
this to store the taxon identifier and other little bits of information
not handles explicitly in the header lines. (Plus this can handle
chimeras which may have been a use case).

Peter




More information about the Biopython mailing list