[BioPython] Initial work on a GFF parser
Brad Chapman
chapmanb at 50mail.com
Sun Mar 8 12:29:41 EDT 2009
Hi all;
Generic Feature Format (GFF) is a nice tab delimited file format
that we don't have full support for in Biopython. Michael Hoffman
contributed code to work with GFF MySQL databases (in Bio.GFF), but
we don't have a GFF parser for the flatfiles. Looking back over the
list archives, this has come up a couple of times without a finished
solution being implemented. GFF suffers from the curse of being too easy
to hack together a solution for parsing a very specific problem, while
generating a good standard parser takes more work.
Recently, Peter brought up GFF on the BioSQL mailing list, which
made me interested in digging into GFF as an input and output flat
file format for BioSQL databases. Towards this end I put together an
initial implementation of a GFF (version 3) parser for Biopython. A
write up and the code are here:
http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/
As described in the post, the GFF interface will be a bit different
from the standard SeqIO interface, since GFF stores features
separately from the sequences and also doesn't require features for
a record to be grouped together.
As a result, the interface is up for discussion and the best path is to
start with an implementation and see where it takes us. I'd be grateful
for any feedback and code from those who are interested. We can discuss
on the development mailing list or on the blog, and move towards getting
stable full featured GFF parsing in Biopython.
Brad
More information about the BioPython
mailing list