[DAS2] on local and global ids
Andrew Dalke
dalke at dalkescientific.com
Thu Mar 16 10:38:00 EST 2006
Thomas:
> I'm not sure that DAS1 experience is a good model for this. It's true
> that people didn't always point to well-known reference servers, but I
> think this has more to do with the fact that people didn't know which
> server to point to.
I think I said there are two cases; there's actually several
1. the sources document states a well-known COORDINATES
and makes no links to segments
2. the sources document refers to a well-known segments server
("the" reference server) and no COORDINATES
3. the source document has a segments document, and each segment
listed uses URIs from "the" reference server
4. the server implements its own coordinates server, with
new segment ids
5. When uploading a track to Ensembl there's no need to have
either COORDINATE or segments -- the upload server can
verify for itself that the upload uses the right ids.
The *only* concern is with #4. Everything else uses the well-known
global identifier for segments.
> I'd still argue that the majority -- probably the vast majority -- of
> people setting up DAS servers really just want to make an assertion
> like "I'm annotating build NCBI35 of the human genome" and be done
> with it.
I'm fine with that. There are two ways to do it. #1 and #2 above.
In theory only one of those is needed. The document can point to
"the" reference server for NCBI 35.
In practice that's not sufficient because there is no authoritative
NCBI 35 server.
Hence COORDINATES provides an abstract global identifier describing
the reference server.
> That's what the coordinate system stuff in DAS/2 is for. If this is
> documented properly I don't think we'll see many "end-user" sites
> setting up their own reference servers unless a) they want an internal
> mirror of a well-known server purely for performance/bandwidth reasons
> or b) they want to annotate an unpublished/new/whatever genome
> assembly.
A philosophical comment. I'm a distributed, self-organizing kinda
guy. I don't think single root centralized systems work well when
there are many different groups involved.
I think many people will use the registry server, but not all.
I think there will be public DAS servers which aren't in the registry.
I know there will be in-house DAS servers which aren't.
I'm just about certain that some sites will have local copies of
the primary data. They do for GenBank, for PDB, for SWISS-PROT,
for EnsEMBL. Why not for DAS?
That said, here's a couple of questions for you to answer:
a) When connecting to a new versioned source containing only
COORDINATES data, what should the client do to get the list
of segments, sizes, and primary sequence?
I can think of several answers. My answer is that the versioned
source should state the preferred reference server and unless
otherwise configured a client should use that reference server
and only that reference server.
Yes, all the reference servers for that coordinate system
are supposed to return the same results. But that's only if
they are available. There are performance issues too, like
low bandwidth or hosting the server on a slow machine. The
DAS client shouldn't round-robin through the list until it
finds one which works because that could take several minutes
to timeout on a single server, with another 10 to try.
Yes, a client can be configured and told "for coordinate
system A use reference server Z". But that's a user
configuration.
b) If there is a local mirror of some reference server, how
should the local DAS clients be made aware of it? (And
should this be a supportable configuration? I think so.)
I'm pretty sure that most DAS clients won't be configurable
to look for local servers instead of global ones. Even if
they are, I'm pretty sure each will have a different way
to do so. Apollo and Bioperl will use different mechanisms.
I have no good answer for this. It sounds like your answer
is "people won't have local copies." I think they will.
Ideas:
- have a rewriting registry server which does a rewrite of
the information from the other servers. But this doesn't
work because the feature result from the remote server (in
my scheme) is given using its local segment names. There's
no way to go from that local name to the appropriate mirror
reference server. This suggests that the results really do
need to be given through global ids, with no support for
local ones. The segments result optionally provides a way
to resolve a global name through a local resource.
- set up an HTTP proxy service for DAS requests which
transparently detects, translates and redirects to the
appropriate local resource. Cute, but not likely to be
done in real life.
c) A group has been working on a new genome/assembly. The
data is annotated on local machines using DAS and DAS writeback
Finally it's published. Do they need to rewrite all their
segment identifiers to use the newly defined global ones?
As there are only a few places where the segment identifier is
used, and it's an interface layer, I think the conversion is
easy. But it is a flag day event which means people don't
want to do it. Instead, it's more likely that local people
will set up a synonym table to help with the conversion.
There are perhaps a dozen groups which might do this and they
all have competent people. This should not be a problem.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list