[Biopython] User-defined SeqRecord annotations are trashed in INSDC formats?

Peter Cock p.j.a.cock at googlemail.com
Tue Mar 22 12:14:05 EDT 2011


On Tue, Mar 22, 2011 at 3:44 PM, Uri Laserson <laserson at mit.edu> wrote:
>> As far as the current Biopython output goes, you can basically use any
>> (short) string as a qualifier key.
>
> Sorry, I meant for the values, not the keys.  Can you have a list of strings
> as a value?

Right. Again yes, plus I think a single string as the value should work.
This is because the INSDC feature table allows multiple values for a
tag - for example you often get multiple database cross references.

>> Using a source feature is really just a work around for the fact that
>> GenBank/EMBL do not support arbitrary record level annotation.
>> Do you have to use this as your output format?
>
> Agreed.  Essentially, I have a huge pile of sequencing reads that are highly
> annotated.  For any given read, there are some annotations that are
> independent of the sequence itself (which is what I am trying to implement
> now) and there are some annotations that are associated with subsequences
> (which is why SeqFeatures are very appropriate).  Ideally, I want a file
> format that will store the data, be easily parsable (and fast), and can be
> readable using something like `less` (though this last feature is less
> important).

For this the GenBank/EMBL format with the source feature trick
does sound workable. You just need to be careful how how and
when you create the dummy source feature - I'd do it at the last
moment before writing out the file, and in that way you can avoid
things like slicing throwing it away.

>> Would you not be
>> better off with using a database or something else instead?
>
> Well, initially I used XML to store the data, but I quickly realized I was
> reinventing the wheel, especially when it came to annotating features
> on top of the sequences.

I wonder if one of the INSDC XML formats would work nicely here?
i.e. If they can be extended more easily. We should look at adding a
parser for them to Biopython (and write support too ideally of course).

> Are you suggesting something like SQLite?  How would I deal with
> SeqFeature-type annotations?

I was thinking you could use the BioSQL schema (run on SQLite if
you wanted to, or MySQL or PostgresSQL etc). You'd still face the
same issues if/when you wanted to dump the annotated records
to a plain text file though.

Peter



More information about the Biopython mailing list