Identifiers (was: [Biocorba-l] BSANE and bioCORBA)

Andrew Dalke dalke@acm.org
Fri, 1 Jun 2001 11:59:38 -0600


Just scanning through the emails on this topic and saw
Scott's examples including

  x-sequence-na:gb.123:AL121903.13

Databases can have multiple keys, and even multiple
unique keys.  This can occur when the database augments
another database, or when the naming scheme is changing
and both the old and new systems are preserved.

There are two naming concerns I see related to that.

When I first started in bioinformatics I had a hard
time knowing which of the different fields to use.
(What am I saying - I still do! :)  With the increase
in the number of databases and database numbering
schemes, it's still hard to keep track of which
number to use in which context.  It isn't obvious
which is the primary one.

For example, should I use the ID, AC or SV field
in an EMBL record?  Or the locus, accession, version
or gi fields of genbank,

So I would prefer a bit more explicitness in the
name to reflect which field is used.  To use Scott's
example, something like

  x-sequence-na:gb.123:accession:AL121903.13
(or "ac" or "acc" if you want a shorter name.)

or
  x-sequence-na:gb.123:version:AL121903.13

I'm not sure enough about the distinction between
the "version" and "accession" fields to fully understand
when to use one over the other.  And that's part of
my point - by not naming something at the beginning,
you force others to know the right context in which
to use a name.


The other concern is if the data base changes its
naming scheme.  Suppose there is an ID and a NID (for
"New ID").  For backwards compatibility with old
references you will want to access the database like

   x-sequence-na:database:ID:QWERTY

but let new references handle new entries like

   x-sequence-na:database:NID:QWERTY

Without the extra description of which identifier
to use, it would be impossible to allow this.

BTW, some of these, like accessions, can occasionally match
multiple records.  I'm not sure how to handle when, say,
three records match - either ignore it, return one arbitrarily,
if the format allows, return them all at once, or return
something like a multi-part record or iterator over the return.

This is actually related to search methods and is more
complex than I think is appropriate for now.  But I've
seen people use accessions as primary-like keys so if there
is need for a swiss:ac:QWERTY then this case has to be
addressed.

                    Andrew
                    dalke@acm.org