[Biopython-dev] fpc and gff

Brad Chapman chapmanb at 50mail.com
Mon Sep 28 08:52:38 EDT 2009


Jose;
Glad you're interested in working on this. I'm happy to get the GFF3
writing up to speed for this task.

> > I'm interested in parsing an fpc physical map and writing a gff3 file from it.
[...]
> > But I have not a clear idea on the relation between GFF and SeqFeature. The
> > main problem is the subfeature and the gff feature hierarchy. My take on that
> > at the moment is to write a GFFfeature class similar to the gff feature with
> > seqid, source, type, start, end, score, etc. and go from the fpc to
> > GFFFeature objects. 

> Right now there isn't a "proper way" as Brad's GFF code hasn't
> been integrated into Biopython yet.

Yes, we still have some flexibility here since it hasn't been merged
into Biopython yet, so let's talk about what works best.

> I think Brad was thinking of using the SeqFeature object "as is" to hold
> GFF features, with the sub-features list used for the hierarchy.

What exists now takes an iterator of SeqRecord objects, and writes
each SeqFeature as a GFF3 line:

seqid -- SeqRecord ID
source -- Feature qualifier with key "source"
type -- Feature type attribute
start, end -- The Feature Location
score -- Feature qualifier with key "score"
strand -- Feature strand attribute
phase -- Feature qualifier with key "phase"

The remaining qualifiers are the final key/value pairs of the
attribute.

The hierarchy is represented as sub_features of the parent feature.
This handles any arbitrarily deep nesting of parent and child 
features.

There is some really basic code on the documentation page:

http://biopython.org/wiki/GFF_Parsing#Writing_GFF3

> Michiel and I had suggested a simpler structure more faithful to the
> GFF model might be useful - even if it was just a standardised tuple
> of the start, end, strand, id, etc, and an annotation dictionary). For
> the SeqIO interface, these GFF features would have to be turned
> into normal SeqFeature objects of course.

This could also be useful for a more lightweight representation. I
would rather see this type of representation with primary Python
types, as opposed to a GFFFeature specific class. The current
SeqRecord/SeqFeature implementations is relatively close to what 
a GFF specific class would be so there would be a lot of duplication
without saving much in terms of speed or memory.

Jose, let me know if you'd rather go with a SeqRecord approach or a
lightweight approach. If you provide a couple of examples of the
features you want to store, we can work through how to best
represent those in the GFF hierarchy and then the details of
prepping them for writing.

Brad


More information about the Biopython-dev mailing list