[DAS2] current text of draft 3 of spec
Andrew Dalke
dalke at dalkescientific.com
Sun Mar 5 01:59:15 UTC 2006
I've been working on the 3rd draft for the spec. Because of
the confusion in the previous version I've decided on a
different approach where I jump into the middle and describe
how the parts fit together before getting into the details of
every element type or the theory behind the architecture.
I think this flows much better.
====================
DAS is a protocol for sharing biological data. This version of the
specification, DAS 2.0, describes features located on the genomic
sequence. Future versions will add support for sharing annotations of
protein sequences, expression data, 3D structures and ontologies. The
genomic DAS interface is deliberately designed so there will be a
large core shared with the protein sequence DAS.
A DAS 2.0 annotation server provides feature information about one or
more genome sources. Each source may have one or more versions.
Different versions are usually based on different assemblies. As an
implementation detail an assembly and corresponding sequence data may
be distributed via a different machine, which is called the reference
server.
Annotations are located on the genomic sequence with a start and end
position. The range may be specified multiple times if there are
alternate coordinate systems. An annotation may contain multiple
non-continguous parts, making it the parent of those parts. Some
parts may have more than one parent. Annotations have a type based on
terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets
contain a set of properties used to depict a given type.
Annotations can be searched by range, type, and a properties table
associated with each annotation. These are called feature filters.
DAS 2.0 is implemented using a ReST architecture. Each document (also
called an entity or object) has a name, which is a URL. Fetching the
URL gets information about the document. The DAS-specific documents
are all in XML. Other data types have existing widely used formats,
and sometimes more than one for the same data. A DAS server may
provide a distinct document for each of these formats, along with
information about which formats are available.
DAS 2.0 addresses some shortcomings of the DAS 1.x protocol, including:
* Better support for hierachical structures (e.g. transcript + exons)
* Ontology-based feature annotations
* Allow multiple formats, including formats only appropriate for
some feature types
* A lock-based editing protocol for curational clients
* An extensible namespacing system that allows annotations in
non-genomic coordinates (e.g. uniprot protein coordinates or PDB
structure coordinates)
=====
A DAS server supplies information about genomic sequence data sources.
The collection of all sources, each data source, and each version of a
data source are accessible through a URL. All three classes of URLs
return a document of content-type 'application/x-das-sources+xml'
though likely with differing amounts of detail. A 'versioned source'
request returns information only about a specific version of a data
source. A 'source' request returns the list of all the versioned
source data for that source. A 'sources' request returns the list of
all the source data, including all the versioned source data.
The URLs might not be distinct. For example, a server with only one
version of one data source may use the same URL for all three
documents, and a server for a single organism may use the same URL for
the 'sources' and 'source' documents.
Most servers will list only the data sources provided by that server.
Some servers combine the sources documents from other servers into a
single document. These registry servers act as a centralized index
and reduce configuration and network overhead. A registry server uses
the same sources format as an annotation server.
Here is an example of a simple sources document which makes no
distinction between the three sources categories.
Request:
http://www.example.com/das/genome/yeast.xml
Response:
Content-Type: application/x-das-sources+xml
<?xml version="1.0" encoding="UTF-8"?>
<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00"
xml:base="http://www.example.com/das/genome/">
<SOURCE id="yeast.xml" title="Saccharomyces cerevisiae (Baker's
yeast) genome"
doc_href="http://www.example.com/yeast.html">
<VERSION id="yeast.xml" created="2005-12-05">
<COORDINATES taxid="4932" source="Gene_ID" authority="SGD32" />
<CAPABILITY type="features" query_id="features.xml" />
<CAPABILITY type="types" query_id="types.xml"/>
</VERSION>
</SOURCE>
</SOURCES>
All identifiers and href attributes in DAS documents follow the XML
Base specification (see http://www.w3.org/TR/xmlbase/ ) in resolving
partial identifiers and href attributes. In this case the id
"yeast.xml" is fully resolved to
"http://www.example.com/das/genome/yeast.xml".
Here is an example of a more complicated sources document with
multiple organisms each with multiple versions. Each of the two
source documents (one for each organism) has a distinct URL as does
each of the version for each organism. This is a pure registry server
because the actual annotation data comes from other machines.
Request:
http://www.biodas.org/known_servers
Response:
Content-Type: application/x-das-sources+xml
<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00">
<SOURCE id="http://das.ensembl.org/das/SPICEDS/"
title="das_vega_trans">
<VERSION id="http://das.ensembl.org/das/SPICEDS/127/"
created="2005-05-23">
<MAINTAINER email="someone at sanger.ac.uk" />
<COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
test_range="BX255914" />
<CAPABILITY types="segments"
query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
<CAPABILITY type="features"
query_id="http://das.ensembl.org/das/SPICEDS/127/features" />
<SUPPORTS name="das2queries" />
</CAPABILITY>
<CAPABILITY type="types"
query_id="http://das.ensembl.org/das/SPICEDS/127/types" />
</VERSION>
<VERSION id="http://das.ensembl.org/das/SPICEDS/128/"
created="2005-08-13">
<MAINTAINER email="someone-else at sanger.ac.uk" />
<COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
test_range="BX255914" />
<CAPABILITY type="segments"
query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
<CAPABILITY type="features"
query_id="http://das.ensembl.org/das/SPICEDS/128/features" />
<SUPPORTS name="das2queries" />
</CAPABILITY>
<CAPABILITY type="types"
query_id="http://das.ensembl.org/das/SPICEDS/128/types" />
<CAPABILITY type="locks"
url="http://das.ensembl.org/das/SPICEDS/128/locks" />
<CAPABILITY type="writeback"
url="http://das.ensembl.org/das/SPICEDS/128/locks" />
</VERSION>
</SOURCE>
<SOURCE id="http://www.example.com/das2/mus/sources.xml" title="Mus
musculus">
<VERSION id="http://www.example.com/das2/mus/42/sources.xml"
created="2006-02-11">
<MAINTAINER email="pied-piper at hamlet.ac.uk" />
<COORDINATES taxid="10090" source="Clone" authority="Ensembl"
test_range="AL935121" />
<CAPABILITY type="features"
query_id="http://www.example.com/cgi-bin/features-mus-v42.cgi">
<SUPPORTS name="das2queries" />
</CAPABILITY>
<CAPABILITY type="types"
query_id="http://www.example.com/das2/mus/v42/types.xml" />
</VERSION>
</SOURCE>
</SOURCES>
Each SOURCE id and VERSION id is individually fetchable so the URL
"http://das.ensembl.org/das/SPICEDS/" returns a sources document with
the SOURCE record for "das_vega_trans" and both of its VERSION
subelements while "http://das.ensembl.org/das/SPICEDS/128/" returns a
sources document with only the second of its VERSION subelements.
DAS documents refer to other documents through URLs. There are no
restrictions on the internal form of the URLs, other than the query
string portion. Server implementers are free to choose URLs which
best fit the architecture needs. For example, a simple DAS server may
be implemented as a set of XML files hosted by a standard web server
while more complex servers with search support may be implemented as
CGI scripts or through embedded web server extensions. The URLs do
not need to define a hierarchical structure nor even be on the same
machine. Compare this to the DAS1 specification where some URLs were
constructed by direct string modification of other URLs.
=====
Each versioned source contains a set of segments. A segment is the
largest chunk of contiguous sequence. For fully sequenced organisms a
segment may be a chromosome. For partially assembled genomes where
the distance between the assembled regions is not known then each
region may be its own segment. If a server provides annotations in
contig space then each contig is a segment. Feature locations are
specified on ranges of segments which is why a specific set of
segments is called a coordinate system. [coordinate-system] This
specification does not describe how to do alignments between different
coordinate systems.
The sources document format has two ways to describe the coordinate
system. The optional COORDINATES element uniquely characterize the
coordinate system. If two data sources have the same authority and
source values then they must be annotations on the same coordinate
system. The specific coordinate system is also called the "reference
sequence".
A versioned source may contain CAPABILITY elements which describe
different ways to request additional data from a DAS server. Each
CAPABILITY has a type that describes how to use the corresponding URL
to query a DAS server. A CAPABILITY element of type "segments" has a
query URL which returns a document of content-type
"application/x-das-segments+xml". A segments document lists
information about the segments in the coordinate system. Here is an
example of a segments document.
Request:
http://www.biodas.org/das2/h.sapiens/v3/segments.xml
Response:
Content-Type: application/x-das-segments+xml
<?xml version="1.0" encoding="UTF-8"?>
<SEGMENTS xmlns="http://www.biodas.org/ns/das/genome/2.00">
<SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr1.xml"
name="Chr1" length="245522847"
doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=1"/>
<SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr2.xml"
name="Chr2" length="243018229"
doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=2"/>
</SEGMENTS>
=====
The versioned source record for an annotation server must include a
CAPABILITY of type "features". A client may use the query URL from
the features CAPABILTY points to select features which match certain
criteria. If no criteria are specified the server must return all
features unless there are too many features to return. In that case
it must respond with an error message.
Unless an alternate format is specified, the response from the
features query is a document of content-type
"application/x-das-features+xml" containing all of the matching
features. Here is an example features document for a server which
contains a gene and an alignment.
Request:
http://das.biopackages.net/das/genome/yeast/S228C/features.pl
Response:
Content-Type: application/x-das-features+xml
<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
xml:base="http://www.example.org/volvox/1/">
<FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
<LOC segment="Chr2/1200:2917:1" />
</FEATURE>
<FEATURE id="feature/hit12"
type_id="type/est-alignment"
created="2001-12-15T22:43:36"
modified="2004-09-26T21:10:15" >
<LOC segment="Chr3/1201:1400:1" />
<PART id="feature/hit12.hsp1" />
<PART id="feature/hit12.hsp2" />
<ALIGN target_id="feature/yk12391" range="200:299" />
<PROP key="est2genomescore" value="180" />
</FEATURE>
<FEATURE id="feature/hit12.hsp1"
type_id="type/est-alignment-hsp">
<LOC segment="Chr3/1201:1250:-1" />
<PARENT id="feature/hit12"/>
<ALIGN target_id="feature/yk12391" range="1:52" gap="M49 D1 M1"/>
<PROP key="est2genomescore" value="180" />
</FEATURE>
<FEATURE id="feature/hit12.hsp2"
type_id="type/est-alignment-hsp" >
<LOC segment="Chr3/1351:1400:1" />
<PARENT id="feature/hit12" />
<ALIGN target_id="feature/yk12391" range="53:100" gap="M20 D1 G1
M30" />
<PROP key="est2genomescore" value="120" />
</FEATURE>
</FEATURES>
Each feature has a unique identifier and an identifer linking it to a
type record. Both identifiers are URLs and should be directly
fetchable. Simple features can be located on a region of a segment.
More complex features like a gapped alignment are represented through
a parent/part relationship. A feature may have multiple parents and
multiple parts.
=====
An annotation server may contain many features while the client may
only be interested in a subset; most likely features in a given
portion of the reference sequence. To help minimize the bandwidth
overhead the feature query URL should support the DAS feature filter
language. The syntax uses the standard HTML form-urlencoded GET query
syntax. For example, here is a request for all features on Chr2.
Request:
http://www.example.org/volvox/1/features.cgi?inside=Chr2
Response:
Content-Type: application/x-das-features+xml
<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
xml:base="http://www.example.org/volvox/1/">
<FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
<LOC segment="Chr2/1200:2917:1" />
</FEATURE>
<FEATURE id="feature/hit12"
type_id="type/est-alignment"
created="2001-12-15T22:43:36"
modified="2004-09-26T21:10:15" >
<LOC segment="Chr3/1201:1400:1" />
<PART id="feature/hit12.hsp1" />
<PART id="feature/hit12.hsp2" />
<ALIGN target_id="feature/yk12391" range="200:299" />
<PROP key="est2genomescore" value="180" />
</FEATURE>
</FEATURES>
and here is the rather long one for all EST alignments
Request:
http://www.example.org/volvox/1/features.cgi?
type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment
Response:
Content-Type: application/x-das-features+xml
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
xml:base="http://www.example.org/volvox/1/">
<FEATURE id="feature/hit12"
type_id="type/est-alignment"
created="2001-12-15T22:43:36"
modified="2004-09-26T21:10:15" >
<LOC segment="Chr3/1201:1400:1" />
<PART id="feature/hit12.hsp1" />
<PART id="feature/hit12.hsp2" />
<ALIGN target_id="feature/yk12391" range="200:299" />
<PROP key="est2genomescore" value="180" />
</FEATURE>
</FEATURES>
=====
All features are linked to a type record. DAS types do not describe a
formal type system in that DAS types do not derive from other DAS
types. Instead it links to an external ontology term and describes
how to depict features of that type.
A DAS annotation server must contain a CAPABILITY element of type
"types". A client may use its query URL to fetch a document of
content-type "application/x-das-types+xml". The document lists all of
the types available on the server. We expect that servers will have
at most a few dozen types so DAS does not support type filters.
The following is a hypothetical example of a DAS annotation server
providing GENSCAN gene predictions for zebrafish. Each feature is
either of type
"http://www.example.org/das/zebrafish/build19/high-type" or
"http://www.example.org/das/zebrafish/build19/low-type" depending on
if the data provider determined it was a high probability or low
probability prediction. Even though there are two different type
records the refer to the same ontology term, in this case the SO term
for "gene". The distinction exists so that the high probability
features are depicted differently from the low probability features.
Request:
http://www.example.org/das/zebrafish/build19/types
Response:
Content-Type: application/x-das-types+xml
<TYPES xmlns="http://www.biodas.org/ns/das/genome/2.00"
xml:base="http://www.example.org/das/zebrafish/build19/">
<TYPE id="high-type" title="High probability gene predictions"
doc_href="http://www.example.org/docs/genscan_prediction.html#high"
source="GENSCAN 1.0"
ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
accession="SO:0000704"
<STYLE>
<BOX fgcolor="red" border_width="1"/>
</STYLE>
</TYPE>
<TYPE id="low-type" title="Low probability gene predictions"
doc_href="http://www.example.org/docs/genscan_prediction.html#low"
source="GENSCAN 1.0"
ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
accession="SO:0000704"
<STYLE>
<BOX fgcolor="yellow" border_width="1"/>
</STYLE>
</TYPE>
</TYPES>
[coordinate-system]
We make a distinction between "coordinate system" and "numbering
system". The coordinate system is the set of segment on which
features are located. The numbering system describes how to identify
the specific residues in the segment. DAS uses a 0-based coordinate
system where the first residue is numbered "0", the second "1", and so
on. Other numbering systems include 1-based coordinates and the PDB
numbering system which preserves the residue number for key residues
across homologous family by allowing discontinuities, insertions and
negative values as position numbers.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list