[Biopython-dev] Bio.GFF and Brad's code

Tue Apr 14 10:36:03 UTC 2009

--- On Mon, 4/13/09, Brad Chapman <chapmanb at 50mail.com> wrote:
> A normal use case would be:
> 
> - Use SeqIO to parse a FASTA file with the sequences =>
> SeqRecords
> - Use the GFFParser to add features from a separate GFF
> file to the  SeqRecords. These are SeqFeatures, added to
> the right records and nested in a parent/child relationship
>  as appropriate.

Usually, when I use a GFF file I either don't have an associated Fasta file, or I am not particularly interested in the original sequences. So while this approach is useful for some people, in its current form it's not exactly generally usable.

First, let's discuss how to represent the information contained in a GFF file. SeqRecords are good if the GFF file is associated with a Fasta file (or contains the sequence itself), but if not it seems to be a bit awkward. How about the following (and I think Peter was hinting at the same idea):

The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects that closely resemble the GFF file structure. For example, we use the GFF specified fields (<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]) as attributes to Bio.GFF.Record objects.

Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the appropriate fields of a SeqRecord. Here, we have to think about two cases: Simply creating a SeqRecord based on the GFF file, and adding the information in the GFF file as annotations to a pre-existing set of SeqRecords. (I am not sure if we need a separate function for that, or, as Peter suggested, let the user do that himself, guided by some examples in the documentation).

Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs.

How does that sound?

--Michiel