[Biopython-dev] fpc and gff

Mon Sep 28 13:10:22 UTC 2009

On Mon, Sep 28, 2009 at 1:52 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
>> Michiel and I had suggested a simpler structure more faithful to the
>> GFF model might be useful - even if it was just a standardised tuple
>> of the start, end, strand, id, etc, and an annotation dictionary). For
>> the SeqIO interface, these GFF features would have to be turned
>> into normal SeqFeature objects of course.
>
> This could also be useful for a more lightweight representation. I
> would rather see this type of representation with primary Python
> types, as opposed to a GFFFeature specific class. The current
> SeqRecord/SeqFeature implementations is relatively close to what
> a GFF specific class would be so there would be a lot of duplication
> without saving much in terms of speed or memory.

Indeed. Which is why I quite like the idea of a simple tuple of ints,
strings and a dict for the annotation (the final column of a GFF file).
This should also be fast for people dealing with big GFF files.

The other plus point here is we can get this (GFF parsing/writing
using basic Python objects) into Biopython first, and then look at
the SeqIO side of things more carefully as a second merge. I may
be overly cautious but I want the resulting GFF <-> SeqRecord <->
GenBank/EMBL/etc mapping to try and follow established practice
as closely as possible, which will need lots of testing and probably
some tweaking of this mapping.

i.e. To me there is a natural break between the basics of GFF
parsing/writing, and the transformation into our existing object
models.

[This applies to all file formats in principle, but most are so simple
that it isn't really an issue worth worrying about.]

Peter