Identifiers (was: [Biocorba-l] BSANE and bioCORBA)

Fri, 1 Jun 2001 21:19:50 +0100 (GMT Daylight Time)

On Fri, 1 Jun 2001, Andrew Dalke wrote:

> Databases can have multiple keys, and even multiple unique keys.  
> This can occur when the database augments another database, or when
> the naming scheme is changing and both the old and new systems are
> preserved.

So, is a mechanism that allows the name of the identifier field to be
specified necessary? Before conceding, I'll contend:

- This puts an extra onus on the user to know the name of this field (e.g.
is it 'ac', 'acc', 'accession' or 'AC'?); so I'd like there to be a
default mechanism where the database assumes it's being handed its concept
of a primary identifier (if you've got that wrong, then woe betide you).

- Is this perhaps confusing the concept of a primary identifier, with the
name used for that identifier in the flatfile entry?

- I'll see if I get shot down on this, but having two (or more) sets of
identifiers that have the same names for different entries sounds like a
design flaw (though it can happen... especially with numeric values).

i.e. 

  PID1  | PID2  |
  a1    | a1    |    <-- Ug!
  a2    | b1    |
  a3    | c1    |

I believe there is only one case in the 12+ million EMBL records where the
ID of one entry is the same as the AC in another entry. Though for good
reason, we assign AC as the highest priority identifier - see below.

Much better to "unique-ify" the name of each PID: i.e. PID1_a1 and
PID2_a1. In this case, I can ask for any PID and know I'll get back the
right entry, without specifying the name of the field required.

> For example, should I use the ID, AC or SV field in an EMBL record?  
> Or the locus, accession, version or gi fields of genbank,

For the record, the first SV/VERSION [syntax 'accession.version'] is the
primary identifier that should be used. The accession is unique to an
entry and stable across database versions (unlike ID/locus); the version
increments by one each time the sequence entry is changed. (If others are
present, they are secondary identifiers).

ID/locus is not guaranteed to be stable over database versions -so beware!
It's there primarily because people want human-understandable identifiers
that are loaded with semantics (if semantics change => name changes!)

IMO, 'gi' numbers are evil - Unlike locus/ID, accessions and versions,
they are not part of the nucleotide collaborative data exchange agreement
between NCBI, EBI and DDBJ.

> So I would prefer a bit more explicitness in the name to reflect which
> field is used.  To use Scott's example, something like
> 
>   x-sequence-na:gb.123:accession:AL121903.13

That should be

  x-sequence-na:gb.123:accession:AL121903

and I'd interpret as pointing to the most recent version.

c.f. how I would interpret:

  x-sequence-na:gb:accession:AL121903

>   x-sequence-na:gb.123:version:AL121903.13

I'd say this should be:

  x-sequence-na:gb.123:accession:AL121903.13

And I'd interpret it as pointing to version 13 of the entry with accession
AL121903. (This assumes semantics and syntax of 'name.version').

N.B. According to this naming system, in EMBL you would specify:

  x-sequence-na:embl.99:AC:AL121903.13

Rather than:

  x-sequence-na:gb.123:AL121903.13 & x-sequence-na:embl.99:AL121903.13

or even

  x-sequence-na:gb:AL121903 & x-sequence-na:embl:AL121903

Of course, should you decide to use gi's (Grrrr!), then 

  x-sequence-na:gb:31224 is valid and unique
but
  x-sequence-na:embl:31224 doesn't exist.

> The other concern is if the data base changes its naming scheme.  
> Suppose there is an ID and a NID (for "New ID").  For backwards
> compatibility with old references you will want to access the database
> like
> 
>    x-sequence-na:database:ID:QWERTY
> 
> but let new references handle new entries like
> 
>    x-sequence-na:database:NID:QWERTY
> 
> Without the extra description of which identifier to use, it would be
> impossible to allow this.

If you re-use old identifiers for different entries in your new database,
then you're just asking for trouble. Don't go there!

It's a tough decision: support old identifiers, or throw them all away as
being flawed and force users to migrate.

Though I ask if you're mixing up the concept of primary identifiers, with
the name of the PID field in an entry? i.e. If I am supporting old
identifiers, then so long as they don't clash with the new identifiers, I
can still ask for it as the primary identifier and return the correct
entry. If it's been retired, I get a 'does not exist' exception (and annoy
& confuse my users).

> BTW, some of these, like accessions, can occasionally match multiple
> records.  I'm not sure how to handle when, say, three records match -
> either ignore it, return one arbitrarily, if the format allows, return
> them all at once, or return something like a multi-part record or
> iterator over the return.

Whether an identifier matches more than one entry shouldn't be an issue
for the identifier itself. Though, I'd like to think that identifiers are
defined as being unique, otherwise they're not true identifiers - in which
case an exception should be thrown if a method tries to return more than
one.

Regards,
Alan.

--
============================================================
Alan J. Robinson, D.Phil.             Tel:+44-(0)1223 494444
European Bioinformatics Institute     Fax:+44-(0)1223 494468
EMBL Outstation - Hinxton             Email:  alan@ebi.ac.uk
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, UK                http://industry.ebi.ac.uk/~alan/
============================================================