[Bioperl-l] bioperl + GFF3 audit

Jason Stajich jason at bioperl.org
Wed Sep 19 00:04:05 UTC 2007


Something to throw out there for discussion with GFF3 gurus.  Maybe  
we can have a little STATE-OF-GFF3 and compliance at the GMOD  
workshop after Genome Informatics in Nov?

I propose after we get the next stable release out we consider doing  
a systematic code audit to insure that we can really generate proper  
GFF3 compliant data from all of our parsers.  This would include both  
good ID/Parent as well as .  I'd be happy to also think about making  
sure we can generate proper GTF/GFF2.5 - whether this means we have a  
translator that works on these objects or we have to code this into  
the parser software that creating the sequence features, not sure.   
The whole Bio::Tools mishmash is a little unsettling when trying to  
generate standardized output.  I'm not really clear if Bio::FeatureIO  
actually tries to do this properly, but 'gene_id'/'transcript_id' for  
GTF and ID/Parent 3-level Features for gene->transcript->exon/CDS  
doesn't really come out properly and I end up writing workarounds on  
the downstream data.

One aspect that is biting is the flat versus multi-level features  
(genes -> transcripts -> exons) and how we handle them.  I think this  
ought to get fleshed out better so we can really support .  A lot of  
the Bio::Tools parsers are generally pretty laissez fair here about  
things and we have a variety of non-standard and non-compliant aspects.

For example, I am playing with tRNA parsing and I assume that proper  
GFF3 here is three levels of :
gene -> tRNA -> exon
with those being the primary_tag names that correspond to the  
Sequence Ontology.

I have modified the code locally to report generic features but which  
have sub-features that must be extracted.  In addition the ID/Parent  
fields are explicitly filled in and I wonder if we want to do a  
better job insuring these are meaningfully entered?

So if there are interested people out there we can try and hammer out  
a todo list on the wiki and see if we're generating proper GFF3 in  
the first place and trying to make sure all the features that get fed  
out to Bio::FeatureIO or Bio::Tools::GFF can get properly transformed  
into GFF3 and GTF output.

Comments/Volunteers?

-jason

--
Jason Stajich
jason at bioperl.org




More information about the Bioperl-l mailing list