[BioSQL-l] BioSQL and ontology "standards".

Peter biopython at maubp.freeserve.co.uk
Fri Nov 28 15:04:33 EST 2008


On Fri, Nov 28, 2008 at 7:16 PM, Richard Holland wrote:
>
> BioJava does what BioPerl does and pretty much makes it up as it goes
> along, using whatever the input files tell it.

OK, good.  But which ontology names do you use for which tables?  i.e.
Do you also use ad-hoc ontologies named  'SeqFeature Keys',
'SeqFeature Sources' and 'Annotation Tags'?

To be a little more specific, here are some examples - which I presume
(hope) are all coping BioPerl's conventions.

In recording a bioentry date, Biopython sets
bioentry_qualifier_value.term_id to point to a term table entry
"date_changed" which belongs to the ad-hoc "Annotation Tags" ontology.

In recording most bioentry annotations (a list of keywords), Biopython
sets bioentry_qualifier_value.term_id to point to a term table entry
for that annotation type (e.g. "keywords") which belongs to the ad-hoc
"Annotation Tags" ontology.

In recording a seqfeature, Biopython sets seqfeature.seqfeature_key_id
to point to a term table entry for that feature type (e.g. "CDS",
"misc_feature", "gene") which belongs to the ad-hoc "SeqFeature Keys"
ontology.  Biopython always sets seqfeature.type_term_id to point to a
term table entry for "EMBL/GenBank/SwissProt" within the ad-hoc
"SeqFeature Sources" ontology.

In recording most of a seqfeature's qualifiers (annotations),
Biopython sets seqfeature_qualifier_value.term_id to point to a term
table entry for the key (e.g. "locus_tag", "note", "translation")
which belongs to the ad-hoc "Annotation Tags" ontology.

Notice that the ad-hoc "Annotation Tags" ontology serves double duty,
doing both bioentry and seqfeature annotations.  This doesn't seem
entirely sensible.

On the other hand, when recording a seqfeature's location Biopython
and BioPerl leave location.term_id as NULL (rather than using any
particular ontology term).  This seems arbitary.

Relating to this, if we want to record a composite location type
(typically "join"), we'd want to use the location_qualifier_value
table.  BioPerl seems to leave this table empty (presumably assuming
all composite locations are joins) which is what Biopython currently
does too.  Here we can't just set location_qualifier_value.term_id as
NULL (why not?) so we have to introduce something.  The BioSQL
projects should first agree what ontology term and what ontology this
should be stored with.

> The trouble with throwing exceptions when things don't meet standards is
> that people complain when their custom files don't work, and can't be
> made to work without editing the file itself. ...

I'm not sure if you are talking about parsing files, or loading them
into BioSQL.  I agree that when parsing sometimes some leeway is
required.

In terms of *optionally* enforcing a strict ontology, throwing an
error is a good thing if the input file doesn't follow the ontology -
this indicates a problem with the file (or perhaps an out of date
ontology).  I would certainly leave the default behaviour as is with
the ad-hoc ontologies extended on the fly.

> I think the best approach is to always to use what the file says, and
> trust that it's accurate. What needs to be agreed between projects is
> any additional annotations that get introduced outside the context of
> file parsing, and the names of the ontologies used for the file
> annotations so that all projects use the same ontologies and don't
> replicate them inside the BioSQL database. It would be nice to
> standardise these names and the additional custom terms across the
> projects, in much the same way as people tried already to standardise
> the way general objects get mapped to BioSQL.

This is what I am trying to get at here - documenting the existing "ad
hoc" ontology usage.  My impression is that it has not been
documented, and that the BioPerl behaviour is the defacto BioSQL
standard.

I'd like to pin down this standard, and extend it for situations like
the location_qualifier_value.term_id and perhaps location.term_id
where BioPerl seems to ignore the ontology issue.

Peter


More information about the BioSQL-l mailing list