[Biopython-dev] Bio.GFF and Brad's code

Sat Dec 5 15:54:19 UTC 2009

I didn't realize that the GFF parser returns SeqRecords. I agree with Peter that a parser returning SeqRecords should be accessed through Bio.SeqIO, while a lower-level parser can live in Bio.GFF.

--Michiel

--- On Thu, 12/3/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] Bio.GFF and Brad's code
> To: "Brad Chapman" <chapmanb at 50mail.com>, biopython-dev at lists.open-bio.org
> Date: Thursday, December 3, 2009, 9:53 AM
> On Thu, Dec 3, 2009 at 2:25 PM, Brad
> Chapman <chapmanb at 50mail.com>
> wrote:
> >
> > Great -- done for parsing and writing and committed to
> GitHub. The
> > documentation is updated as well.
> >
> > Happy to get other comments and thoughts. Thanks
> again,
> >
> 
> I understand that GFF files are complex, and a simple
> "record
> iterator" isn't flexible enough to cover all use cases -
> hence the
> need for a complex parser class. That said, Michiel is
> right that
> GFF.parse() or GFF.read() functions would be consistent
> with
> other bits of Biopython, and would provide for the simple
> use
> cases.
> 
> Looking at your code, BCBio.GFF.parse(...) would return
> SeqRecord objects (with SeqFeatures). That seems
> redundant to me as one expect people to just use
> Bio.SeqIO.parse(handle, "gff3") instead. I would instead
> have expected BCBio.GFF.parse(...) to iterate over the
> features in a GFF file.
> 
> Also, and we'd touched on this before - I'd much prefer to
> have the GFF module quite "low level" using either new
> GFF-specific classes or simple Python objects (e.g. for
> each feature use a tuple of ints and strings for the first
> feature columns plus a dict for the final extendible
> column of annotation).
> 
> >From a technical point of view, a justification for
> this
> separation is the GFF details are not a perfect fit to the
> SeqRecord and SeqFeature objects and forcing their
> use adds unnecessary overheads for people wanting
> to work directly with the features themselves.
> 
> Also, by splitting the code into basic parsing and a
> SeqRecord/SeqFeature conversion layer (which I
> would put in Bio/SeqIO/GffIO.py) we can add the
> code in two steps (first GFF parsing, then SeqIO
> support).
> 
> I think this split is useful as this is a very big job to
> do
> properly: Once we have GFF to SeqRecord parsing,
> we need to try and ensure that it is compatible with the
> GenBank to SeqRecord parsing. This is important as
> we would in effect be extending Biopython to allow
> GFF3 to GenBank conversions. For testing all this,
> we can grab the same data in the two file formats
> (e.g. from the NCBI) and perhaps also use EMBOSS.
> 
> You may recall we talked to Peter Rice (from EMBOSS)
> about this - there are some important issues here like
> ontology mapping where we should be able to reuse a
> lot of the work EMBOSS has already done (and use the
> EMBOSS tools to help validate our mapping).
> 
> i.e. While I may be being overly cautious, I think that
> while adding GFF parsing and GFF to SeqRecord
> mapping is very important, it is also very complex.
> Therefore breaking this into a two stage task makes
> managing and testing it easier - as well as seeming
> a good idea for the code itself.
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>