[DAS2] URIs for sequence identifiers
Andrew Dalke
dalke at dalkescientific.com
Mon Mar 13 18:45:04 EST 2006
Proposals:
- do not use segment "name" as an identifier
- rename it "title" (human readable only)
- allow a new optional "alias-of" attribute which is the
link to the primary identifier for this segment
- change the feature location to use the segment uri
- change the feature filter range searches so there is a new "segment"
keyword and so the "includes", "overlaps", etc. only work on
the given segment, as
segment=<uri>
inside=$start:$stop
overlaps=$start:$stop
contains=$start:$stop
identical=$start:$stop
- If 'includes', 'overlaps', etc. are given then the 'segment'
must be given (do we need this restriction? It doesn't make
sense to me to ask for "annotations on 1000 to 2000 of anything"
- only allow at most one each of includes, overlaps,
contains, or identical (do we need this restriction?)
- multiple segments may be given, but then range searches
are not supported (do we need this restriction?)
Discussion:
The discussion on this side of things was based on today's phone
conference. Andreas needs data sources to work on multiple
coordinate spaces.
To quote from Andreas:
> There are several servers that understand more than one coordinate
> system and can return the same type of data in different coordinates.
> (depending on which type of accession code/range was used for the
> request ) E.g. there are a couple of zebrafish servers that speak
> both in Chromosome and Scaffold coordinates. (reason perhaps
> being that zebrafish is an organism that seems to be very difficult
> to assemble ?)
The current DAS system does not support this because of how
it does segment identifiers.
The current scheme looks like this:
<!-- sources.xml -->
<SOURCES ...>
<SOURCE ...>
<VERSION ...>
<COORDINATES authority="Andreas" source="Scaffold" ... />
<COORDINATES authority="Andreas" source="Chromosome" ... />
<CAPABILITY type="segments" query_id="http://sanger/andreas/" />
....
Problem #1: We need two entry points, one to view the segments
in Scaffold space, the other to view them in Chromosome space.
Solution #1 (don't like it though).
Add a "source=" attribute to the CAPABILITY and allow multiple
segments capabilities
<!-- sources.xml -->
<SOURCES ...>
<SOURCE ...>
<VERSION ...>
<COORDINATES authority="Andreas" source="Scaffold" ... />
<COORDINATES authority="Andreas" source="Chromosome" ... />
<CAPABILITY type="segments"
query_id="http://sanger/andreas/scaffolds.xml" source="Scaffold"
/>
<CAPABILITY type="segments"
query_id="http://sanger/andreas/chromosomes.xml"
source="Chromosome" />
....
I don't like it because it feels like the COORDINATES and
CAPABILITY[type="segments"] field should be merged. Still, I'll
go with it for now.
Problem #2: feature searches return features from either namespace
Consider search for name=*ABC* (that is, "ABC" as a substring in
the "name" or "alias" fields). Then the result might be
<FEATURES>
<FEATURE id="F0001" type_id="T0001">
<LOC segment="A/100:200" />
</FEATURE>
</FEATURES>
Where "A" is a short-hand notation for one of the segments?
Which one? The client goes to the segment servers:
Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
<SEGMENT id="http://whatever.com/ChromosomeA" name="A" length="2000" />
</SEGMENTS>
Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
<SEGMENT id="http://whatever.com/ScaffoldA" name="A" length="2000" />
</SEGMENTS>
The segment name "A" matches either ChromosomeA or ScaffoldA, and
there's no way to figure out which is correct!
This comes because our own naming scheme is not very good at
being globally unique. We could fix it by also stating the
namespace in the result, as
<FEATURES>
<FEATURE id="F0001" type_id="T0001">
<LOC segment="A/100:200" source="Scaffold"/>
</FEATURE>
</FEATURES>
Gregg asked "why don't we just use the URI"?
After a long discussion we decided to propose just that.
That is, get rid of the "name" attribute. Instead, use a
"title" attribute which is human readable and an optional
"alias-of" which contains is the primary identifier for
the given segment.
The alias-of value is determined by the person who
defined the COORDINATES. It could be a URL. It could
a URI. It does not need to be resolvable (though it
should - perhaps to a human readable document? Or to
something which lists all known aliases to it?)
That is, the segments document will look like this
Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
<SEGMENT uri="http://whatever.com/ChromosomeA" length="2000"
alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Chromosome/A"
title="Chromosome A" />
</SEGMENTS>
Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
<SEGMENT uri="http://whatever.com/ScaffoldA" length="2000"
alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Scaffold/A"
title="Scaffold A" />
</SEGMENTS>
This has a few implications. Feature locations must be given
with respect to the segment uri, as
<FEATURES>
<FEATURE id="F0001" type_id="T0001">
<LOC segment_uri="http://whatever.com/ScaffoldA" range="200:300"/>
</FEATURE>
</FEATURES>
Given this segment_uri a client can figure out if it is in
Scaffold or Chromosome space because it can check all of the
URIs in each space for a match.
The other change is in range searches. Consider the current
scheme, which looks like
includes=ChrA
includes=A/100:300
The query is of the form $ID or $ID/$start:$end. It needs to be
changed to support URLs. For examples,
includes={http://www.whatever.com/ChromosomeA
includes={http://www.whatever.com/ScaffoldA}/100:300
We couldn't come up with a better syntax. Then Gregg asked
"why do we need multiple includes"?
That is, the current syntax supports
includes=ChrA/0:1000;includes=ChrB/2000:3000;includes=ChrC/5000:6000
to mean "anywhere on the first 1000 bases of ChrA, the 3rd 1000
bases of ChrB, or the 6th 1000 bases of ChrC".
Given the query language, we're looking for way to write that
using URLs, as
includes={http://www.whatever.com/ChromosomeA}0:1000;includes={http://
www.whatever.com/ChromosomeB}:2000:3000;includes={http://
www.whatever.com/ChromosomeC}:5000:6000;
However, that's a very unlikely query. What if we split the
"includes", "overlaps", etc. into "includes_segment" and
"includes_range".
In that case:
old-style:
includes=A/500:600
new-style:
includes_segment=http://www.whatever.com/ChromosomeA;
includes_range=500:600
old-style:
includes=A/500:600,Chr3/700:800
new-style:
includes_segment=http://www.whatever.com/ChromosomeA;
includes_range=500:600;
includes_range=700:800
old-style:
includes=A/500:600,D/700:800
new-style: -- NOT POSSIBLE
old-style:
includes=A/500:600,D/500:600
new-style: (not likely to be used in real life)
includes_segment=http://www.whatever.com/ChromosomeA;
includes_segment=http://www.whatever.com/ChromosomeD;
includes_range=500:600;
This no longer allows searches with subranges from different segments.
The again -- who cares? Those sorts of searches are strange.
Talking some more. Who needs the ability to do more than one
includes / overlaps / etc. query at a time? Gregg wants the
ability to do a combination of includes and overlaps, but
that's all.
We can simplify the server code by only supporting one
inside search, one contains search, and/or one overlaps
search, instead of the current system which allows a more
constructive geometry, and we can move the segment id out
into its own parameter.
Allen said that that would prevent more complicated types
of analysis on the server, but that anyone doing more
complicated searches would pull the data down locally.
Does anyone want to do more than one overlaps search at
at time? More than one contains search at a time? More
than one identical search at a time?
(For that matter, does anyone actually want to do a "identical"
search? Gregg thinks it will be useful to find any other
annotations which are exactly matching the given range.
I think that might be better with a "include"/"exclude" combination
to have start/end positions within a couple of bases from
the specified range.)
PROPOSAL:
Change the range query language to have
segment= <<the url of the segment to search>
inside= $start:$end
overlaps= $start:$end
contains= $start:$end
Example:
segment=http://whatever.com/ChromosomeD;inside=5000:6000
Also, only allow at most one includes, one overlaps, and
one contains (unless people want it). I'm less sure about
the need for this restriction. It might be as easy to
implement the more complex search as it would be to check
for the error cases.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list