[BioSQL-l] Importing GFF3 files into a BioSQL database?

Peter biopython at maubp.freeserve.co.uk
Mon Feb 23 10:23:07 UTC 2009


>>> Do any of the Bio* projects currently let you import a GFF3 (or even a
>>> GFF2) file into a BioSQL database?
>>
>> To get more familiar with BioPerl GFF mappings, I took a look at how
>> GenBank files get converted to GFF files with BioPerl. Generally,
>> most things map as you'd expect but a few items are left behind. I
>> wrote up the details on the current mappings, along with some
>> proposals for expanding them, here:
>>
>>
>> http://bcbio.wordpress.com/2009/02/22/exploring-bioperl-genbank-to-gff-mapping/

That looks interesting Brad.  You seem to be focusing on the top-level
record annotation.  From my point of view, the feature annotation (or
feature qualifiers as Biopython calls them) is more interesting.

Brad wrote:
>> Normalizing the GFF and standard SeqIO representations is a great
>> idea. I use BioSQL quite a bit, and it would be nice to be able to
>> output GFF formatted files directly from bioentries.

I was thinking just loading GFF files with Biopython's SeqIO to start
with (and thus GFF into BioSQL).  In order to dump a BioSQL entry into
a GFF file we'd need to have Biopython's SeqIO be able to write GFF
files.  I'm not sure if we can do that if the parsing is lossy - it is
the relationships between features that strike me as most work
(storing these as simple strings may be good enough).

On Sun, Feb 22, 2009 at 11:18 PM, Chris Fields <cjfields at illinois.edu> wrote:
>
> Re: bioperl; we're planning on refactoring several bits in BioPerl for
> consistency.
>
> http://www.bioperl.org/wiki/GFF_Refactor
>
> The problem is there are several different methods to parse and generate GFF
> strings, some only partially implemented.  I would like to coalesce around a
> central mode of generating such output, or at least have a way to validate
> such data.
>
> Another issue is that the typical bioperl seqfeature comes flattened
> (non-hierarchal) and untyped (no check against SO) by default when parsed
> from a GenBank file.  The bp_genbank2gff3.pl attempts to rectify this, but I
> think an integrated optional way of generating unflattened (hierarchal)
> typed feature data within the SeqIO parsers would be better.  There is a
> simple way we could implement this, just need time to work it in.

Something we've touched on before on this mailing list is loading
GenBank files into BioSQL while checking them against an ontology
(rather than the current ad-hoc ontology where new terms (even
spelling errors) get recorded as new ontology terms).  That seems to
be a related point.

If I understand correctly from Chris and Brad's posts, with BioPerl
you could do GFF file to GenBank file to BioSQL, but not directly?

Peter




More information about the BioSQL-l mailing list