[Biopython-dev] Bio.GFF and Brad's code

Tue Dec 8 14:15:30 UTC 2009

On Tue, Dec 8, 2009 at 1:33 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> We seem to have a very different view of SeqRecords/SeqFeatures. To
> me, they are a convenient well thought out object model to capture
> annotations and features associated with a sequence. They have the
> advantage that people who have used Biopython will be familiar with
> the object model. That's why I chose to use them for representing GFF,
> as opposed to a GFF specific class.

OK, but (as I expand on below), your planned use of the SeqFeature
(while legitimate) appears to risk being inconsistent with existing parts
of the Biopython code base (in particular, GenBank output, and maybe
GenomeDiagram).

> You are adding on two extra conditions:
>
> - If something produces SeqRecords, it needs to come from SeqIO.

It was more of an aim than a rule. Isn't true of all the existing code for
historical reasons, e.g. Bio.SeqIO "genbank" support acts as a thin
wrapper to Bio.GenBank which does offer SeqRecord objects. For
a user perspective, if you want a SeqRecord from a sequence file,
the first point of call should be Bio.SeqIO.

> - If you have a SeqRecord, it has to be compatible with GenBank
>  output.
>
> This quickly ties us up to the not-that-great GenBank way of
> representing features and locations, and makes it hard to add on more
> flexible formats like GFF. Converting between very different feature
> representations is going to be complex and a whole new problem;
> why do you have to support that to use a SeqRecord in your code?

The big aim of Bio.SeqIO was to allow using many different file
formats with the same object representation. Implicitly (assuming
the required data is present), input from one file format could be
output in another format. The problem lots of current code in
Biopython uses SeqRecord/SeqFeatures in a particular way
(GenBank/EMBL parsers, GenomeDiagram, GenBank output).
Unfortunately, for GFF files it seems this isn't the most natural
way to use SeqFeature objects (where you need real nesting).

> Overall, I'd like to see it be simpler for people to contribute and
> add parsers to Biopython.

I hope that for simple file formats this already the case. But for
annotation rich file formats, if we want SeqIO to continue to be
useful for conversion, this by neccessity requires some
awareness of how the other parsers/writers will represent
the same data.

One option for contributions is to offer a "low level" parser
using basic Python datatypes or simple file-type specific
records. Then someone more familiar with SeqIO and the
other file formats can write a SeqRecord converter in order
to integrate it into Bio.SeqIO.  This is basically how Ace,
Phred, SwissProt (and probably others) were done.

>> I still think it would be useful to have Bio/GFF/Parser.py (or
>> similar) as the low level parser, and Bio/SeqIO/GffIO.py (or
>> similar) to turn this into SeqRecord and SeqFeature objects.
>
> This appears to be about where the code lives. Personally, I prefer
> having things under the GFF namespace and then building thin
> wrappers around if in SeqIO if desired. Practically, I want to leave
> SeqIO inclusion out right now and try to argue only for getting the
> GFF specific parser in.

Where the code lives isn't a big issue. You can do a thin
wrapper in Bio.SeqIO calling Bio.GFF (where Bio.GFF makes
SeqRecords), or a fat wrapper (where Bio.GFF does not make
SeqRecords).

The problem (as I see it) is SeqIO integration and how your
desired use of SeqFeatures will impact this.

>> The nested features that worry me. Perhaps the existing
>> location operator (e.g. "join") could be set to something
>> like "parent/child" if the subfeatures is used to hold child
>> features rather than the elements of a join? We need
>> the GenBank output code etc to be able to tell these
>> apart reliably.
>
> Right now I don't set the location operator at all. The parent/child
> model is much more flexible than the GenBank operator stuff, so
> maybe the right way to go is to phase out using the operator at all.
> If it is set to nothing than parent/child is assumed, and GenBank
> output can add in all of the operators at output time.

I agree that using SeqFeature sub-features for parent/child
relationships makes a lot of sense. BUT, we have a lot of
existing code which follows the GenBank/EMBL parser
route of using this for joins (and a few other corner cases).

There are other annoyances with the current SeqFeature
and FeatureLocation model - the strand and location operator
are part of the SeqFeature not the FeatureLocation. It would
make more sense to me to move them to the FeatureLocation
(and have that handle joins itself). Or, move everything to
the SeqFeature (and get rid of the FeatureLocation object).

I think the best route forward is to plan a transition of the
SeqFeature object to allow nice handling of real nested
relationships, and a reworking of complex location handling.
Then (hopefully) we can have the GenBank/EMBL/GFF3
parsers all using the SeqFeature in a consistent way.

Peter