[Biopython-dev] Bio.GFF and Brad's code
Peter
biopython at maubp.freeserve.co.uk
Thu Dec 3 14:53:44 UTC 2009
On Thu, Dec 3, 2009 at 2:25 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Great -- done for parsing and writing and committed to GitHub. The
> documentation is updated as well.
>
> Happy to get other comments and thoughts. Thanks again,
>
I understand that GFF files are complex, and a simple "record
iterator" isn't flexible enough to cover all use cases - hence the
need for a complex parser class. That said, Michiel is right that
GFF.parse() or GFF.read() functions would be consistent with
other bits of Biopython, and would provide for the simple use
cases.
Looking at your code, BCBio.GFF.parse(...) would return
SeqRecord objects (with SeqFeatures). That seems
redundant to me as one expect people to just use
Bio.SeqIO.parse(handle, "gff3") instead. I would instead
have expected BCBio.GFF.parse(...) to iterate over the
features in a GFF file.
Also, and we'd touched on this before - I'd much prefer to
have the GFF module quite "low level" using either new
GFF-specific classes or simple Python objects (e.g. for
each feature use a tuple of ints and strings for the first
feature columns plus a dict for the final extendible
column of annotation).
>From a technical point of view, a justification for this
separation is the GFF details are not a perfect fit to the
SeqRecord and SeqFeature objects and forcing their
use adds unnecessary overheads for people wanting
to work directly with the features themselves.
Also, by splitting the code into basic parsing and a
SeqRecord/SeqFeature conversion layer (which I
would put in Bio/SeqIO/GffIO.py) we can add the
code in two steps (first GFF parsing, then SeqIO
support).
I think this split is useful as this is a very big job to do
properly: Once we have GFF to SeqRecord parsing,
we need to try and ensure that it is compatible with the
GenBank to SeqRecord parsing. This is important as
we would in effect be extending Biopython to allow
GFF3 to GenBank conversions. For testing all this,
we can grab the same data in the two file formats
(e.g. from the NCBI) and perhaps also use EMBOSS.
You may recall we talked to Peter Rice (from EMBOSS)
about this - there are some important issues here like
ontology mapping where we should be able to reuse a
lot of the work EMBOSS has already done (and use the
EMBOSS tools to help validate our mapping).
i.e. While I may be being overly cautious, I think that
while adding GFF parsing and GFF to SeqRecord
mapping is very important, it is also very complex.
Therefore breaking this into a two stage task makes
managing and testing it easier - as well as seeming
a good idea for the code itself.
Peter
More information about the Biopython-dev
mailing list