[BioSQL-l] Importing GFF3 files into a BioSQL database?

Chris Fields cjfields at illinois.edu
Sun Feb 22 23:18:54 UTC 2009


Re: bioperl; we're planning on refactoring several bits in BioPerl for  
consistency.

http://www.bioperl.org/wiki/GFF_Refactor

The problem is there are several different methods to parse and  
generate GFF strings, some only partially implemented.  I would like  
to coalesce around a central mode of generating such output, or at  
least have a way to validate such data.

Another issue is that the typical bioperl seqfeature comes flattened  
(non-hierarchal) and untyped (no check against SO) by default when  
parsed from a GenBank file.  The bp_genbank2gff3.pl attempts to  
rectify this, but I think an integrated optional way of generating  
unflattened (hierarchal) typed feature data within the SeqIO parsers  
would be better.  There is a simple way we could implement this, just  
need time to work it in.

chris

On Feb 22, 2009, at 3:18 PM, Brad Chapman wrote:

> Hi Peter;
>
>> Do any of the Bio* projects currently let you import a GFF3 (or  
>> even a
>> GFF2) file into a BioSQL database?
>
> Normalizing the GFF and standard SeqIO representations is a great
> idea. I use BioSQL quite a bit, and it would be nice to be able to
> output GFF formatted files directly from bioentries.
>
> To get more familiar with BioPerl GFF mappings, I took a look at how
> GenBank files get converted to GFF files with BioPerl. Generally,
> most things map as you'd expect but a few items are left behind. I
> wrote up the details on the current mappings, along with some
> proposals for expanding them, here:
>
> http://bcbio.wordpress.com/2009/02/22/exploring-bioperl-genbank-to-gff-mapping/
>
> I think for the Biopython mapping we could try and follow what
> BioPerl does where it makes good sense, and introduce the other
> items in a way that is consistent and could be followed by other
> projects.
>
> Hope this helps move things forward,
> Brad
>
>
>
>>
>> Looking at some of the examples on
>> http://www.sequenceontology.org/gff3.shtml this looks possible.  I
>> assume each GFF file normally describes features on a single
>> plasmid/chromosome, meaning a single bioentry table entry.  I would
>> expect each GFF feature to become a seqfeature table entry (with a
>> location table entry for each line describing its location), and the
>> main sequence (if present in the GFF file), would be a biosequence
>> table entry.  So far this isn't too complicated.  The GFF3
>> documentation gives some example of "parent" or rather "part-of"
>> relationships between features (e.g. an exon which is part of three
>> parent mRNA features).  Perhaps three entries in the
>> seqfeature_relationship table could record this.
>>
>> Also, GFF3 files seem to be very organised with regards ontologies -
>> something we have touched on before on this mailing list.
>>
>> My reason for asking regards adding GFF parsing to Biopython.
>> Biopython has a parsing framework (Bio.SeqIO) which turns various  
>> file
>> formats (e.g. GenBank) into objects (SeqRecord objects, with optional
>> SeqFeature objects), which we can map onto the BioSQL tables.  If we
>> manage to integrate GFF parsing into Biopython's Bio.SeqIO framework
>> (non-trivial), then Biopython would as a consequence be able to  
>> load a
>> GFF file into BioSQL.  If any of the other Bio* projects can already
>> import GFF files into BioSQL, I'd like Biopython to load the data  
>> into
>> the database in the same way.  Essentially this would give a recipe
>> for how we should model the GFF data in our objects in order to
>> achieve this intra-project BioSQL compatibility.
>>
>> Peter
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l




More information about the BioSQL-l mailing list