[BioPython] Initial work on a GFF parser

Mon Mar 9 10:14:55 UTC 2009

On Sun, Mar 8, 2009 at 4:29 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
> Generic Feature Format (GFF) is a nice tab delimited file format
> that we don't have full support for in Biopython. Michael Hoffman
> contributed code to work with GFF MySQL databases (in Bio.GFF), but
> we don't have a GFF parser for the flatfiles. Looking back over the
> list archives, this has come up a couple of times without a finished
> solution being implemented. GFF suffers from the curse of being too easy
> to hack together a solution for parsing a very specific problem, while
> generating a good standard parser takes more work.

You're right about creating a good general parser taking more work ;)

See also enhancement Bug 2762, GFF capability in SeqIO, which has some
discussion.

Also, it wasn't clear from your blog if you are thinking about just
GFF version 3, or something more general, coping with the assorted
comparatively ill defined GFF2 variants.

> Recently, Peter brought up GFF on the BioSQL mailing list, which
> made me interested in digging into GFF as an input and output flat
> file format for BioSQL databases. Towards this end I put together an
> initial implementation of a GFF (version 3) parser for Biopython. A
> write up and the code are here:
>
> http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/
>
> As described in the post, the GFF interface will be a bit different
> from the standard SeqIO interface, since GFF stores features
> separately from the sequences and also doesn't require features for
> a record to be grouped together.

Regarding where to put this code, if it isn't going to support the
Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but
maybe Bio.GFF or Bio.GFF3 instead.

However, you could still fit gff(3) files into Bio.SeqIO, its just
that the sequence may not be present.  This would be similar GenBank
files usually have a long list of features plus the full sequence, but
the sequence itself may be missing - for example if there is a just a
CONTIG line.  Or QUAL files from sequencing where there is never a
sequence.

As with GenBank files for large genome/chromosome, for a typical GFF
file for Bio.SeqIO we'd just return a single SeqRecord containing all
the features - within the SeqIO API there is no way to offer memory
efficient iteration over the features themselves.

Maybe we need to invent Bio.FeatureIO for this?  You could consider
GenBank/EMBL feature tables, GFF files, NCBI protein tables, and
probably a few other formats too.

> As a result, the interface is up for discussion and the best path is to
> start with an implementation and see where it takes us. I'd be grateful
> for any feedback and code from those who are interested. We can discuss
> on the development mailing list or on the blog, and move towards getting
> stable full featured GFF parsing in Biopython.

>From the blog post it sounds like you are using sub-features to store
the parent/child relationship between say mRNAs and genes.  This is
elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to
cope with the general parent (part-of) relationships allowed in GFF
files - for example an exon may have multiple parents.

There is also the complication that when parsing GenBank files, a gene
or CDS feature with a join-location ends up represented using
sub-features (which probably would be represented with an explicit
intron/exon structure in GFF files) [This is something I don't really
like with the current object structure].  We'd want things to be
fairly uniform between the parsers - for one thing our BioSQL code
currently records a feature with subfeatures as a single feature in
the database.

Peter