[BioSQL-l] BioSQL and ontology "standards".

Richard Holland holland at eaglegenomics.com
Fri Nov 28 19:16:34 UTC 2008


BioJava does what BioPerl does and pretty much makes it up as it goes
along, using whatever the input files tell it.

The trouble with throwing exceptions when things don't meet standards is
that people complain when their custom files don't work, and can't be
made to work without editing the file itself. By custom I mean not only
things they've written themselves, but also files coming from
established tools which don't follow the rules (NEXUS format is a
classic example of this - the most popular tools that output NEXUS
pretty much ignore the format specification). Even the standards
providers themselves often don't comply with their own rules (several
Genbank examples supplied from NCBI/Entrez break any parser which tries
to be completely strict with the declared format).

I think the best approach is to always to use what the file says, and
trust that it's accurate. What needs to be agreed between projects is
any additional annotations that get introduced outside the context of
file parsing, and the names of the ontologies used for the file
annotations so that all projects use the same ontologies and don't
replicate them inside the BioSQL database. It would be nice to
standardise these names and the additional custom terms across the
projects, in much the same way as people tried already to standardise
the way general objects get mapped to BioSQL.

cheers,
Richard

Peter wrote:
> Hi all,
> 
> The BioSQL schema allows multiple ontologies, so that things like
> entries in seqfeature_qualifier_value can say when they mean by
> "locus_tag".
> 
> Currently BioPerl and Biopython (and I assume the other projects but
> haven't checked) use a couple of ad-hoc ontology names for storing
> annotation.  In particular, if there is no predefined entry for a
> novel ontology term, it gets added on the fly.  This is very
> convenient as it means a BioSQL database can be used without first
> importing a predefined ontology.  However there are downsides, for
> example spelling errors in the keys of a GenBank file get treated as a
> ontology entries.
> 
> Have these ad-hoc ontologies ever been defined?  i.e. For table
> bioentry_qualifier_value terms, which ad-hoc ontology name should be
> used?  Biopython uses ad-hoc ontology named  'SeqFeature Keys',
> 'SeqFeature Sources', 'Annotation Tags' for various different tables
> (which I believe is the same for BioPerl).
> 
> On a related point, it might make more sense to use a predefined
> ontology, like SOFA or SO from http://www.sequenceontology.org/ where
> a novel term is treated as an error (or perhaps falls back on the
> ad-hoc ontology).  How do the various Bio* projects cope with
> annotations in the database for different or multiple ontologies?  Or
> has this not been considered?
> 
> Thanks,
> 
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


More information about the BioSQL-l mailing list