[BioSQL-l] Importing GFF3 files into a BioSQL database?
biopython at maubp.freeserve.co.uk
Wed Feb 18 09:27:55 EST 2009
Do any of the Bio* projects currently let you import a GFF3 (or even a
GFF2) file into a BioSQL database?
Looking at some of the examples on
http://www.sequenceontology.org/gff3.shtml this looks possible. I
assume each GFF file normally describes features on a single
plasmid/chromosome, meaning a single bioentry table entry. I would
expect each GFF feature to become a seqfeature table entry (with a
location table entry for each line describing its location), and the
main sequence (if present in the GFF file), would be a biosequence
table entry. So far this isn't too complicated. The GFF3
documentation gives some example of "parent" or rather "part-of"
relationships between features (e.g. an exon which is part of three
parent mRNA features). Perhaps three entries in the
seqfeature_relationship table could record this.
Also, GFF3 files seem to be very organised with regards ontologies -
something we have touched on before on this mailing list.
My reason for asking regards adding GFF parsing to Biopython.
Biopython has a parsing framework (Bio.SeqIO) which turns various file
formats (e.g. GenBank) into objects (SeqRecord objects, with optional
SeqFeature objects), which we can map onto the BioSQL tables. If we
manage to integrate GFF parsing into Biopython's Bio.SeqIO framework
(non-trivial), then Biopython would as a consequence be able to load a
GFF file into BioSQL. If any of the other Bio* projects can already
import GFF files into BioSQL, I'd like Biopython to load the data into
the database in the same way. Essentially this would give a recipe
for how we should model the GFF data in our objects in order to
achieve this intra-project BioSQL compatibility.
More information about the BioSQL-l