[Biopython-dev] Bio.GFF and Brad's code

Brad Chapman chapmanb at 50mail.com
Tue Dec 8 13:33:12 UTC 2009


Peter and Michiel;
Thanks for the thoughts. Tried to combine these below:

Michiel:
> I didn't realize that the GFF parser returns SeqRecords. I agree with
> Peter that a parser returning SeqRecords should be accessed through
> Bio.SeqIO, while a lower-level parser can live in Bio.GFF.

Peter:
> My point is the moment you include GFF -> SeqRecord
> code (even if not explicitly via the Bio.SeqIO namespace)
> it opens us up to people giving these SeqRecord objects
> to SeqIO for output (e.g. as GenBank).
[...]
> Worth goals, but if by "Produce Biopython objects from
> GFF3/GTF/GFF2 files" you mean SeqRecords with
> SeqFeatures, (as I said above) we are opening up the
> GFF to GenBank can of worms. There is no "later" :(

We seem to have a very different view of SeqRecords/SeqFeatures. To
me, they are a convenient well thought out object model to capture
annotations and features associated with a sequence. They have the
advantage that people who have used Biopython will be familiar with
the object model. That's why I chose to use them for representing GFF,
as opposed to a GFF specific class.

You are adding on two extra conditions:

- If something produces SeqRecords, it needs to come from SeqIO.
- If you have a SeqRecord, it has to be compatible with GenBank
  output.

This quickly ties us up to the not-that-great GenBank way of
representing features and locations, and makes it hard to add on more
flexible formats like GFF. Converting between very different feature
representations is going to be complex and a whole new problem; 
why do you have to support that to use a SeqRecord in your code?

Overall, I'd like to see it be simpler for people to contribute and
add parsers to Biopython.

> I still think it would be useful to have Bio/GFF/Parser.py (or
> similar) as the low level parser, and Bio/SeqIO/GffIO.py (or
> similar) to turn this into SeqRecord and SeqFeature objects.

This appears to be about where the code lives. Personally, I prefer
having things under the GFF namespace and then building thin
wrappers around if in SeqIO if desired. Practically, I want to leave
SeqIO inclusion out right now and try to argue only for getting the
GFF specific parser in.

> The nested features that worry me. Perhaps the existing
> location operator (e.g. "join") could be set to something
> like "parent/child" if the subfeatures is used to hold child
> features rather than the elements of a join? We need
> the GenBank output code etc to be able to tell these
> apart reliably.

Right now I don't set the location operator at all. The parent/child
model is much more flexible than the GenBank operator stuff, so
maybe the right way to go is to phase out using the operator at all.
If it is set to nothing than parent/child is assumed, and GenBank
output can add in all of the operators at output time.

Brad



More information about the Biopython-dev mailing list