[DAS2] on local and global ids
Andrew Dalke
dalke at dalkescientific.com
Wed Mar 15 16:25:53 EST 2006
The discussion today was on local segment identifiers vs. global
segment identifiers.
I'm going to characterize them as "abstract" vs. "concrete"
identifiers. An abstract id has no default resolution to a
resource. A concrete one does.
The identifier "http://www.biodas.org/" is concrete identifier
because it has a default resolver. "lsid:ncbi:human:35" is an
abstract identifier because it has no default resolver (though
there are resolvers for lsid they are not default resolvers.)
The global segment identifier may be a concrete identifier. It
may implement the segments interface. But who is in charge of
that? Who defines and maintains the service? If it goes down,
(power outage, network cable cut) then what does the rest of
the world do?
For the purposes of DAS it is better (IMO) that the global
identifiers be abstract, though they should be http URLs which
are resolvable to something human readable. (This is what
the XML namespace elements do.)
Reference servers are concrete identifiers. They exist. They
can change (eg, change technologies and change the URLs, say
from cgi-bin/*.pl to an in-process servlet.) Now, they should
be long-lived, but that's not how life works.
Suppose someone wants to set up an annotation server, without
setting up a reference server. One solution is to point to
an existing reference server.
<SOURCES>
<SOURCE>
<VERSION>
<CAPABILITY type="segments"
uri="http://some/remote/reference/server" />
<CAPABILITY type="features" uri="features.cgi" />
<CAPABILITY type="types" uri="types.xml" />
</VERSION>
</SOURCE>
</SOURCES>
In this case all the features are returned with segments labeled
as in the reference server. There's no problem.
Second, Andreas wants an abstract "COORDINATE" space id
<SOURCES>
<SOURCE>
<VERSION>
<COORDINATES uri="http://some/arbitrary/coordinate/id"
authority="NCBI"
version="35" .... />
<CAPABILITY type="features" uri="features.cgi" />
<CAPABILITY type="types" uri="types.xml" />
</VERSION>
</SOURCE>
</SOURCES>
This requires a more complicated client because it must have other
information to figure out how to convert from the coordinate identifier
into the corresponding types.
The answer that Andreas and others give is "consult the registry".
That is, look for other other segments CAPABILITY elements with
the same coordinates id. For that to happen there needs to be a
way to associate a segments doc with a coordinate system. For example,
this is what the current spec allows (almost - there's no example
of it and I'm still trying to get the schema working for it)
<SOURCES>
<SOURCE>
<VERSION>
<COORDINATES uri="http://some/arbitrary/coordinate/id"
authority="NCBI"
version="35" .... />
<CAPABILITY type="segments" uri="features.cgi"
coordinates="http://some/arbitrary/coordinate/id" />
</VERSION>
</SOURCE>
</SOURCES>
This makes a resolution scheme from an abstract coordinate identifier
into a concrete segments document identifier.
Why are there so many fields on the coordinates? It could be
normalized,
so you fetch the coordinate id to get the information. It's there
to support searches. A goal has been that the top-level sources
document
gives you everything you need to know about the system.
(Doesn't mean it's elegant. I won't talk about alternatives. It's
not important. There's at most an extra 150 or so bytes per versioned
source.)
The problem comes when a site wants a local reference server.
These segments have concrete local names.
DAS1 experience suggests that people almost always set up local
servers. They do not refer to an well-known server.
There are good reasons for doing this. If the local annotation
server works then the local reference server is almost certain
to work. The well-known server might not work.
Also, the configuration data is in the sources document. There's
no need to set up a registry server to resolve coordinates. There's
no configuration needed in the client to point to the appropriate
concrete identifier given an abstract URL.
My own experience has been that people do not read specifications.
I am an odd-ball. According to
http://diveintomark.org/archives/2004/08/16/specs
I am an asshole. That's okay -- most people are morons.
> Morons, on the other hand, don’t read specs until someone yells at
> them. Instead, they take a few examples that they find “in the wild”
> and write code that seems to work based on their limited sample. Soon
> after they ship, they inevitably get yelled at because their product
> is nowhere near conforming to the part of the spec that someone else
> happens to be using. Someone points them to the sentence in the spec
> that clearly spells out how horribly broken their software is, and
> they fix it.
Someone who wants to implement a DAS reference server will
take the data from somewhere and make up a local naming scheme.
That's what happened with DAS1. That's why Gregg was saying
he maintains a synonym table saying human
1 = chr1 = Chromo1 = ChrI
2 = chr2 = Chromo2 = ChrII
This will not change. People will write a server for local data
and point a DAS client at it. The client had better just work
for the simple case of viewing the data even through there is
no coordinate system -- it needs to, because people will work on
systems with no coordinate system.
Sites will even write multiple in-house DAS servers providing
data, which work because everything refers to the same in-house
reference server.
It's only the first time that someone wants to merge in-house
data with external data that there's a problem. This might be
several months after setting up the server. At that point they
do NOT want to rewrite all the in-house servers to switch to
a new naming scheme.
That's why the primary key for a paired annotation server and
feature must be a local name. That's what morons will use.
Few will consult some global registry to make things interoperable
at the start.
> For example, some people posit the existence of what I will call the
> “angel” developer. “Angels” read specs closely, write code, and then
> thoroughly test it against the accompanying test suite before shipping
> their product. Angels do not actually exist, but they are a useful
> fiction to make spec writers to feel better about themselves.
Lincoln could come up with universal names for every coordinate
system that ever existed or will exist. But people will not
consult it.
However, they will when there is a need to do that. The need comes
in when they want to import external data. At that point they need
a way to join between two different data sources.
They consult the spec and see that there's a "synonym" (or "reference",
or "global", or "master" or *whatever* name -- I went with synonym
because it doesn't imply that it's the better name.)
<SEGMENT uri="segment/chrI" title="Chromosome I" length="230209"
synonym="http://dalkescientific.com/yeast1/ChrI" />
The local name <xml:base> + "segment/ChrI" is also known as
http://dalkescientific.com/yeast1/ChrI . Simple, and requires
very little change in the server code.
The only other change is to support the synonym name when
doing segment requests, as
segment=http://dalkescientific.com/yeast1/ChrI
This is important because then clients can make range requests
from servers without having to download the segment document first.
It's also easy to implement, because it's a lookup table in the
web server interface, and not something which needs to be in
the database proper.
Most people are morons. The spec as-is is written for that.
It's not written for angels. It allows post-facto patch-ups once
people realize they need a globally recognized name.
It does require smarter clients. They need to map from local
name to global name, through a translation table provided by
the server. This is fast and easy to implement. It's easier
to implement than consulting multiple registry servers and
trying to figure out which is appropriate.
And the XML returned will be smaller.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list