[BioPython] Initial work on a GFF parser

Brad Chapman chapmanb at 50mail.com
Mon Mar 9 22:42:24 UTC 2009


Peter;
Thanks much for the feedback.

> See also enhancement Bug 2762, GFF capability in SeqIO, which has some
> discussion.
> 
> Also, it wasn't clear from your blog if you are thinking about just
> GFF version 3, or something more general, coping with the assorted
> comparatively ill defined GFF2 variants.

Bug 2762 had a lot of good background and ideas which helped in
getting started. I did take the sub_feature route instead of the
flattened method Leighton suggested there.

Right now this tackles GFF3. The hard part is going to be
getting a framework in place, and then GFF2 or GFT (or GFF2.5 or
whatever they call it) support could be added.

> Regarding where to put this code, if it isn't going to support the
> Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but
> maybe Bio.GFF or Bio.GFF3 instead.
> 
> However, you could still fit gff(3) files into Bio.SeqIO, its just
> that the sequence may not be present.  This would be similar GenBank
> files usually have a long list of features plus the full sequence, but
> the sequence itself may be missing - for example if there is a just a
> CONTIG line.  Or QUAL files from sequencing where there is never a
> sequence.

Yes, where it lives is a good topic for debate. For GFF files, you'd
at least like the option to add new features to an existing sequence
record, which is what I do here. It would be easy enough to create
new blank records if one is not present initially. The difficult
thing with adding this to the existing syntax is that the GFF files
are not ordered for efficient iteration. You essentially have to
parse the whole file, so something like this would handle the
syntax:

seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
final_seq_dict = SeqIO.add_features(gff_handle, "gff3", initial_dict=seq_dict)

Along these lines, I liked the way you did a sequence/quality dual
iterator for quality output and think that works well when ordering of
the records in multiple files is stable.

> As with GenBank files for large genome/chromosome, for a typical GFF
> file for Bio.SeqIO we'd just return a single SeqRecord containing all
> the features - within the SeqIO API there is no way to offer memory
> efficient iteration over the features themselves.
> 
> Maybe we need to invent Bio.FeatureIO for this?  You could consider
> GenBank/EMBL feature tables, GFF files, NCBI protein tables, and
> probably a few other formats too.

FeatureIO is something BioPerl has; this page describes the status
of GFF in BioPerl but is over a year old so things may have changed:

http://www.bioperl.org/wiki/GFF_code_audit

The iteration model still falls apart because of the undefined
ordering of the file. That is why I settled on the filter approach
to limit what you get to a reasonable memory size but still guarantee
you've pulled all relevant features before building the parent/child
relationships and features. This could also apply to data that comes off
cluster runs where the output order will not necessarily correlate with
the inputs.

The filtering approach could also be useful for large GenBank files,
as you could skip adding features and parsing locations for elements you
are not interested in. If others find this approach intuitive, it
would be worth looking at there as well.

> From the blog post it sounds like you are using sub-features to store
> the parent/child relationship between say mRNAs and genes.  This is
> elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to
> cope with the general parent (part-of) relationships allowed in GFF
> files - for example an exon may have multiple parents.

For these the exon is added as a sub_feature to all of its parents. The
shared feature is the same one in memory. t_nested_multiparent_features
in the test code demonstrates this. How we output it to BioSQL is up for
debate but we should also be able to do some sharing there; duplication
is also not too bad of an option if it makes it cleaner since these are
not likely to be deeply nested.

> There is also the complication that when parsing GenBank files, a gene
> or CDS feature with a join-location ends up represented using
> sub-features (which probably would be represented with an explicit
> intron/exon structure in GFF files) [This is something I don't really
> like with the current object structure].  We'd want things to be
> fairly uniform between the parsers - for one thing our BioSQL code
> currently records a feature with subfeatures as a single feature in
> the database.

BioSQL definitely needs work to handle sub_features more generally.
The seqfeature_relationship table in BioSQL can handle these but it
needs to be coded. I agree with you that the way we do it now is a
little too GenBank specific. This is a bit of a larger project since
we should coordinate with the other projects, but as long as we
continue to support the same location mechanism they use currently
it will be back-compatible with older code.

Thanks again for the thoughts,
Brad



More information about the Biopython mailing list