DAS is a protocol for sharing biological data.  This version of the
specification, DAS 2.0, describes features located on the genomic
sequence.  Future versions will add support for sharing annotations of
protein sequences, expression data, 3D structures and ontologies.  The
genomic DAS interface is deliberately designed so there will be a
large core shared with the protein sequence DAS.

A DAS 2.0 annotation server provides feature information about one or
more genome sources.  Each source may have one or more versions.
Different versions are usually based on different assemblies.  As an
implementation detail an assembly and corresponding sequence data may
be distributed via a different machine, which is called the reference
server.

Annotations are located on the genomic sequence with a start and end
position.  The range may be specified multiple times if there are
alternate coordinate systems.  An annotation may contain multiple
non-continguous parts, making it the parent of those parts.  Some
parts may have more than one parent.  Annotations have a type based on
terms in SOFA (Sequence Ontology for Feature Annotation).  Stylesheets
contain a set of properties used to depict a given type.

Annotations can be searched by range, type, and a properties table
associated with each annotation.  These are called feature filters.

DAS 2.0 is implemented using a ReST architecture.  Each document (also
called an entity or object) has a name, which is a URL.  Fetching the
URL gets information about the document.  The DAS-specific documents
are all in XML.  Other data types have existing widely used formats,
and sometimes more than one for the same data.  A DAS server may
provide a distinct document for each of these formats, along with
information about which formats are available.

DAS 2.0 addresses some shortcomings of the DAS 1.x protocol, including:

 * Better support for hierachical structures (e.g. transcript + exons)

 * Ontology-based feature annotations

 * Allow multiple formats, including formats only appropriate for
   some feature types

 * A lock-based editing protocol for curational clients

 * An extensible namespacing system that allows annotations in
  non-genomic coordinates (e.g. uniprot protein coordinates or PDB
  structure coordinates)


===== The SOURCES document (overview)

A DAS server supplies information about genomic sequence data sources.
The collection of all sources, each data source, and each version of a
data source are accessible through a URL.  All three classes of URLs
return a document of content-type 'application/x-das-sources+xml'
though likely with differing amounts of detail.  A 'versioned source'
request returns information only about a specific version of a data
source.  A 'source' request returns the list of all the versioned
source data for that source.  A 'sources' request returns the list of
all the source data, including all the versioned source data.

The URLs might not be distinct.  For example, a server with only one
version of one data source may use the same URL for all three
documents, and a server for a single organism may use the same URL for
the 'sources' and 'source' documents.

Most servers will list only the data sources provided by that server.
Some servers combine the sources documents from other servers into a
single document.  These registry servers act as a centralized index
and reduce configuration and network overhead.  A registry server uses
the same sources format as an annotation server.

Here is an example of a simple sources document which makes no
distinction between the three sources categories.  It, like the others
new DAS formats, is in XML.  All of the DAS elements are in the XML
namespace

    http://www.biodas.org/ns/das/genome/2.00

(XXX Is that still correct?).  This namespace is reserved and authors
of DAS extensions may not create new elements in it.


Request:

http://www.example.com/das/genome/yeast.xml

Response:

Content-Type: application/x-das-sources+xml

<?xml version="1.0" encoding="UTF-8"?>
<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00"
          xml:base="http://www.example.com/das/genome/">

  <SOURCE id="yeast.xml" title="Saccharomyces cerevisiae (Baker's yeast) genome"
         doc_href="http://www.example.com/yeast.html">
    <VERSION id="yeast.xml" created="2005-12-05">
      <COORDINATES taxid="4932" source="Gene_ID" authority="SGD32" />
      <CAPABILITY type="features" query_id="features.xml" />
      <CAPABILITY type="types" query_id="types.xml"/>
    </VERSION>
  </SOURCE>

</SOURCES>

All identifiers and href attributes in DAS documents follow the XML
Base specification (see http://www.w3.org/TR/xmlbase/ ) in resolving
partial identifiers and href attributes.  In this case the relative id
"yeast.xml" is fully resolved using the xml:base of
"http://www.example.com/das/genome/" to
"http://www.example.com/das/genome/yeast.xml". If the result after
resolving through all the parent xml:base attributes is still a
relative URL then it is resolved once more with respect to the URL
used to fetch the document.


Here is an example of a more complicated sources document with
multiple organisms each with multiple versions.  Each of the two
source documents (one for each organism) has a distinct URL as does
each of the version for each organism.  This is a pure registry server
because the actual annotation data comes from other machines.

Request:
  http://www.biodas.org/known_das_servers

Response:

Content-Type: application/x-das-sources+xml

<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00">
  <SOURCE id="http://das.ensembl.org/das/SPICEDS/" title="das_vega_trans">
    <VERSION id="http://das.ensembl.org/das/SPICEDS/127/" created="2005-05-23">
      <MAINTAINER email="someone@sanger.ac.uk" />
      <COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
                   test_range="BX255914" />
      <CAPABILITY types="segments"
             query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
      <CAPABILITY type="features"
           query_id="http://das.ensembl.org/das/SPICEDS/127/features" />
        <SUPPORTS name="das2queries" />
      </CAPABILITY>
      <CAPABILITY type="types"
           query_id="http://das.ensembl.org/das/SPICEDS/127/types" />
    </VERSION>

    <VERSION id="http://das.ensembl.org/das/SPICEDS/128/" created="2005-08-13">
      <MAINTAINER email="someone-else@sanger.ac.uk" />
      <COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
                   test_range="BX255914" />
      <CAPABILITY type="segments"
             query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
      <CAPABILITY type="features"
           query_id="http://das.ensembl.org/das/SPICEDS/128/features" />
        <SUPPORTS name="das2queries" />
      </CAPABILITY>
      <CAPABILITY type="types"
           query_id="http://das.ensembl.org/das/SPICEDS/128/types" />
      <CAPABILITY type="locks" url="http://das.ensembl.org/das/SPICEDS/128/locks" />
      <CAPABILITY type="writeback"
                url="http://das.ensembl.org/das/SPICEDS/128/locks" />
    </VERSION>
  </SOURCE>

  <SOURCE id="http://www.example.com/das2/mus/sources.xml" title="Mus musculus">
    <VERSION id="http://www.example.com/das2/mus/42/sources.xml" created="2006-02-11">
      <MAINTAINER email="pied-piper@hamlet.ac.uk" />
      <COORDINATES taxid="10090" source="Clone" authority="Ensembl"
                test_range="AL935121" />
      <CAPABILITY type="features"
           query_id="http://www.example.com/cgi-bin/features-mus-v42.cgi">
        <SUPPORTS name="das2queries" />
      </CAPABILITY>
      <CAPABILITY type="types"
           query_id="http://www.example.com/das2/mus/v42/types.xml" />
    </VERSION>
  </SOURCE>
</SOURCES>

Each SOURCE id and VERSION id is individually fetchable so the URL
"http://das.ensembl.org/das/SPICEDS/" returns a sources document with
the SOURCE record for "das_vega_trans" and both of its VERSION
subelements while "http://das.ensembl.org/das/SPICEDS/128/" returns a
sources document with only the second of its VERSION subelements.

DAS documents refer to other documents through URLs.  There are no
restrictions on the internal form of the URLs, other than the query
string portion.  Server implementers are free to choose URLs which
best fit the architecture needs.  For example, a simple DAS server may
be implemented as a set of XML files hosted by a standard web server
while more complex servers with search support may be implemented as
CGI scripts or through embedded web server extensions.  The URLs do
not need to define a hierarchical structure nor even be on the same
machine. Compare this to the DAS1 specification where some URLs were
constructed by direct string modification of other URLs.

===== The SEGMENTS document (overview)

Each versioned source contains a set of segments. A segment is the
largest chunk of contiguous sequence. For fully sequenced organisms a
segment may be a chromosome.  For partially assembled genomes where
the distance between the assembled regions is not known then each
region may be its own segment.  If a server provides annotations in
contig space then each contig is a segment.  Feature locations are
specified on ranges of segments which is why a specific set of
segments is called a coordinate system.  [coordinate-system] This
specification does not describe how to do alignments between different
coordinate systems.


The sources document format has two ways to describe the coordinate
system.  The optional COORDINATES element uniquely characterize the
coordinate system.  If two data sources have the same authority and
source values then they must be annotations on the same coordinate
system.  The specific coordinate system is also called the "reference
sequence".

A versioned source may contain CAPABILITY elements which describe
different ways to request additional data from a DAS server.  Each
CAPABILITY has a type that describes how to use the corresponding URL
to query a DAS server.  A CAPABILITY element of type "segments" has a
query URL which returns a document of content-type
"application/x-das-segments+xml".  A segments document lists
information about the segments in the coordinate system.  Here is an
example of a segments document.

Request:

http://www.biodas.org/das2/h.sapiens/v3/segments.xml

Response:

Content-Type: application/x-das-segments+xml

<?xml version="1.0" encoding="UTF-8"?>
<das:SEGMENTS xmlns:das="http://www.biodas.org/ns/das/genome/2.00">
 <das:SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr1.xml"
     name="Chr1" length="245522847"
     doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=1"/>
 <das:SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr2.xml"
     name="Chr2" length="243018229"
     doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=2"/>
</das:SEGMENTS>


Note that unlike the previous examples this document defined the new
namespace abbreviation "das" instead of defining a default namespace.

===== The FEATURES document (overview)

The versioned source record for an annotation server must include a
CAPABILITY of type "features".  A client may use the query URL from
the features CAPABILTY points to select features which match certain
criteria.  If no criteria are specified the server must return all
features unless there are too many features to return.  In that case
it must respond with an error message.

Unless an alternate format is specified, the response from the
features query is a document of content-type
"application/x-das-features+xml" containing all of the matching
features.  Here is an example features document for a server which
contains a gene and an alignment.

Request:

http://das.biopackages.net/das/genome/yeast/S228C/features.pl

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
          xml:base="http://www.example.org/volvox/1/">
 <FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
   <LOC segment="Chr2/1200:2917:1" />
 </FEATURE>

 <FEATURE id="feature/hit12"
          type_id="type/est-alignment"
          created="2001-12-15T22:43:36"
          modified="2004-09-26T21:10:15" >

   <LOC segment="Chr3/1201:1400:1" />
   <PART id="feature/hit12.hsp1" />
   <PART id="feature/hit12.hsp2" />
   <ALIGN target_id="feature/yk12391" range="200:299" />
   <PROP key="est2genomescore" value="180" />
 </FEATURE>

 <FEATURE id="feature/hit12.hsp1"
          type_id="type/est-alignment-hsp">
   <LOC segment="Chr3/1201:1250:-1" />
   <PARENT id="feature/hit12"/>
   <ALIGN target_id="feature/yk12391" range="1:52" gap="M49 D1 M1"/>
   <PROP  key="est2genomescore" value="180" />
 </FEATURE>

 <FEATURE id="feature/hit12.hsp2"
          type_id="type/est-alignment-hsp" >
   <LOC segment="Chr3/1351:1400:1" />
   <PARENT id="feature/hit12" />
   <ALIGN target_id="feature/yk12391" range="53:100" gap="M20 D1 G1 M30" />
   <PROP  key="est2genomescore" value="120" />
 </FEATURE>

</FEATURES>

Each feature has a unique identifier and an identifer linking it to a
type record.  Both identifiers are URLs and should be directly
fetchable.  Simple features can be located on a region of a segment.
More complex features like a gapped alignment are represented through
a parent/part relationship.  A feature may have multiple parents and
multiple parts.

===== Feature Filters (overview)

An annotation server may contain many features while the client may
only be interested in a subset; most likely features in a given
portion of the reference sequence.  To help minimize the bandwidth
overhead the feature query URL should support the DAS feature filter
language.  The syntax uses the standard HTML form-urlencoded query
syntax.  For example, here is a request for all features on Chr2.

Request:

http://www.example.org/volvox/1/features.cgi?inside=Chr2

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
          xml:base="http://www.example.org/volvox/1/">
 <FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
   <LOC segment="Chr2/1200:2917:1" />
 </FEATURE>

 <FEATURE id="feature/hit12"
          type_id="type/est-alignment"
          created="2001-12-15T22:43:36"
          modified="2004-09-26T21:10:15" >

   <LOC segment="Chr3/1201:1400:1" />
   <PART id="feature/hit12.hsp1" />
   <PART id="feature/hit12.hsp2" />
   <ALIGN target_id="feature/yk12391" range="200:299" />
   <PROP key="est2genomescore" value="180" />
 </FEATURE>
</FEATURES>

and here is the rather long one for all EST alignments

Request:

http://www.example.org/volvox/1/features.cgi?type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment

Response:

Content-Type: application/x-das-features+xml

<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
          xml:base="http://www.example.org/volvox/1/">
 <FEATURE id="feature/hit12"
          type_id="type/est-alignment"
          created="2001-12-15T22:43:36"
          modified="2004-09-26T21:10:15" >

   <LOC segment="Chr3/1201:1400:1" />
   <PART id="feature/hit12.hsp1" />
   <PART id="feature/hit12.hsp2" />
   <ALIGN target_id="feature/yk12391" range="200:299" />
   <PROP key="est2genomescore" value="180" />
 </FEATURE>
</FEATURES>

===== The TYPES document (overview)

All features are linked to a type record.  DAS types do not describe a
formal type system in that DAS types do not derive from other DAS
types.  Instead it links to an external ontology term and describes
how to depict features of that type.

A DAS annotation server must contain a CAPABILITY element of type
"types".  A client may use its query URL to fetch a document of
content-type "application/x-das-types+xml". The document lists all of
the types available on the server.  We expect that servers will have
at most a few dozen types so DAS does not support type filters.

The following is a hypothetical example of a DAS annotation server
providing GENSCAN gene predictions for zebrafish.  Each feature is
either of type
"http://www.example.org/das/zebrafish/build19/high-type" or
"http://www.example.org/das/zebrafish/build19/low-type" depending on
if the data provider determined it was a high probability or low
probability prediction.  Even though there are two different type
records the refer to the same ontology term, in this case the SO term
for "gene".  The distinction exists so that the high probability
features are depicted differently from the low probability features.

Request:

http://www.example.org/das/zebrafish/build19/types

Response:

Content-Type: application/x-das-types+xml

<TYPES xmlns="http://www.biodas.org/ns/das/genome/2.00"
       xml:base="http://www.example.org/das/zebrafish/build19/">
  <TYPE id="high-type" title="High probability gene predictions"
      doc_href="http://www.example.org/docs/genscan_prediction.html#high"
      source="GENSCAN 1.0"
      ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
      accession="SO:0000704"
    <STYLE>
      <BOX fgcolor="red" border_width="1"/>
    </STYLE>
  </TYPE>
  <TYPE id="low-type" title="Low probability gene predictions"
      doc_href="http://www.example.org/docs/genscan_prediction.html#low"
      source="GENSCAN 1.0"
      ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
      accession="SO:0000704"
    <STYLE>
      <BOX fgcolor="yellow" border_width="1"/>
    </STYLE>
  </TYPE>

</TYPES>     

=====  Formats and extensibility

A DAS server may support additional formats to the ones defined in
this specification.  For example, client and server developers may
decide to use a more compact features representation, for better
performance.  The server should list the available formats in the
CAPABILITY section of the SOURCES document.

For example, the following says that the server implements three
formats.  The format name "das2xml" is reserved for the formats
defined by this specification.  The other two format names are
hypothetical:

  <CAPABILITY type="features" query_id="http://example.com/das/features.xml">
    <FORMAT name="das2xml" />
    <FORMAT name="das3xml" />
    <FORMAT name="compact-binary" />
  </CAPABILITY>

To request an alternate format the client must add a "format=<name>"
field to the query string of the URL.  For example, to request all of
the features from the above example but in "das3xml" the client makes
a request for:

  http://example.com/das/features.xml?format=das3xml

while to get all the features on Chr3 ins "compact-binary" format the
client makes a request for

  http://example.com/das/features.xml?inside=Chr3;format=das3xml
  
Servers may extend the features filter language to add new
capabilities as long as those terms do not affect queries without
those fields.  A server may list support for a query extension using
the SUPPORTS tag.  In the following the server says it supports the
"curation-search" as well as the das2xml and compact-binary formats:

  <CAPABILITY type="features" query_id="http://example.com/das/features.xml">
    <SUPPORTS name="curation-search" />
    <FORMAT name="das2xml" />
    <FORMAT name="das3xml" />
    <FORMAT name="compact-binary" />
  </CAPABILITY>

The client implementer must use some other means to discover what
additional filters are available for a "curation-search".

A server may support additional capabilities not defined by this
specification and list support for it through a new CAPABILITY item.
For example, in the following the hypothetical server implements an
alternative query language based on XQuery.

  <CAPABILITY type="xquery-features" query_id="http://example.com/das-search" />

The contents of the non-DAS2 CAPABILITY elements is determined by the
server implementer and a client implementer must look elsewhere to
discover what it means.

=====  Details

This specification makes extensive use of URLs (URI? IRIs?).  While
non-HTTP URLs are possible the exchange protocol uses concepts like
request action and headers, response code and headers, and query
string construction which only make sense in the context of HTTP and
related protocols.

=== Response code

All servers must reply with the appropriate HTTP status code and
clients must react accordingly.


=== Content-Type header

Each of the five new formats has its own MIME type.  These are

application/x-das-sources+xml
application/x-das-features+xml
application/x-das-types+xml
application/x-das-segments+xml
application/x-das-errors+xml

A server should include the correct MIME type in its the Content-Type
header of the response.  If not it must respond with "application/xml"
and must not respond with text/xml.  Character encoding is determined
as per RFC 3023.  We recommend that server implementers either not
include the charset parameter in the Content-Type header or ensure
that it is identical to the encoding in the document's XML
declaration.

For use during specification development a server may include a
"version" value so clients can determine which version of the spec is
implemented by the server.  Unless others can convince me otherwise
this will be removed in the final specification.

Example:

  Content-Type: application/x-das-types+xml; version=300

The list of versions is as follows:

  100 - the version as of 2006/02/07

  200 - the version as of 2006/02/10
     (changed the feature query language format)
     (using "prop-" instead of "att" for property searches)

  300 - the version as of 2006/03/10, which includes
       the updates from the first sprint.

If not present the client may assume the format is in the most recent
version.

==== Segment Locations

Segment locations are used in three places in DAS: feature locations,
range-based feature filters, and sequence retrieval.

Every location is on a segment, refered to by name.  Unlike every
other item in DAS this name is not a URL.  It comes from the "name"
attribute of the SEGMENT element in the SEGMENTS document.  If two
reference servers serve the same coordinate system then the core
segment data -- segment name, size, and sequence data -- must be
identical.  A client may get additional information from any
equivalent reference server or use other means based on knowledge of
the coordinate system.

The residue location in a segment is given as an offset from the first
position, which has a location of 0.  The second position is 1, the
third is 2, and so on.  Segment ranges are given by start and end
positions.  The interval is half-open meaning that the interval from
'start' to 'end' includes the residues from position 'start' up to but
not including position 'end'.  For example, the range (3,6) includes
the residues at positions 3, 4 and 5 but not the one at positions 2 or
6.  This scheme is sometime refered to as "interbase coordinates".

The end coordinate of a range is never less than the start position.
The range (5,6) covers the residue at position 5 while (5,5) has size
of zero and refers to the point between positions 4 and 5.  Cleavage
site annotations may use zero size annotations like the latter.


Features may be located on a strand.  XXX I forgot what we said here;
1 for positive, -1 for negative, 0 for unknown and not given for both?

Feature locations are given in a shorthand notation.  The segment name
is required.  If only the segment name is given then the feature
location is on the entire segment.

The range is optional.  If present it occurs after the segment name
and the short-hand notation is in the form:

     SegmentName + "/" + start + ":" + end

The frame is optional.  If not given then the feature is on both
stands.  If present then the range must also be present.  The
short-hand notation with the strand identifier is

     SegmentName + "/" + start + ":" + end + ":" + strand

Here are some examples of the feature location use in a <LOC> element.

   <LOC segment="Chr1" />       -- all residues of Chr1
   <LOC segment="Chr2/0:2" />   -- the 1st and 2nd residues of Chr2
   <LOC segment="Chr3/20:20" /> -- the site between the 19th and 20th
                                     residues of Chr3
   <LOC segment="Chr1/0:245522847:-1" />  -- the negative strand of Chr1
                                    (assuming a length of 245522847)


The feature filters use the same short-hand notation except without
the strand identifier.  Clients that want features on a specific
strand must post-process the returned list of features.  Client that
want the sequence for the negative strand must compute the reverse
complement of the forward strand.

The forward slash ('/') and colon (':') characters have special
meaning in URLs so should be URL-escaped.  Here are some example query
URLs:

All features that on Chr1 or Chr2
  http://www.biodas.org/das2/h.sapiens/v37/features?overlaps=Chr1;overlaps=Chr2

All features that overlap residues 200-300 of Chr1  ("Chr1/200:300")
  http://www.biodas.org/das2/h.sapiens/v37/features?overlaps=Chr1%2F200%3A300


Sequence retrieval queries work directly on the segment id so do not
need the segment name.  The range is passed using the "range" key of
the query, as in the following:

The sequence for Contig4392
  http://www.biodas.org/das2/h.sapiens/v37/sequence/Contig4392

The sequence for the first 10 residues of Contig4392 ("range=0:10")
  http://www.biodas.org/das2/h.sapiens/v37/sequence/Contig4392?range=0%3A10

The same sequence but in "raw" format
  http://www.biodas.org/das2/h.sapiens/v37/sequence/Contig4392?range=0%3A10;format=raw


  == ISO Dates ===

Several elements have 'created' and 'modified' attributes.  These dates
are formatted in a subset of ISO 8601. http://www.w3.org/TR/NOTE-datetime
Data providers must write the date using one of the following forms

  * Complete date:
       YYYY-MM-DD (e.g. 1997-07-16)
  * Complete date plus hours and minutes: 
       YYYY-MM-DDThh:mmTZD (e.g. 1997-07-16T19:20+01:00)
  * Complete date plus hours, minutes and seconds:
       YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)

      where:
      <pre>
      YYYY = four-digit year
      MM   = two-digit month (01=January, etc.)
      DD   = two-digit day of month (01 through 31)
      hh   = two digits of hour (00 through 23) (am/pm NOT allowed)
      mm   = two digits of minute (00 through 59)
      ss   = two digits of second (00 through 59)
      TZD  = time zone designator (optional; one of the formats
                     "Z", +hh:mm, +hhmm, -hh:mm, or -hhmm)
      </pre>

If the timezone designator is not specified a parser may assume 'Z'.
If the seconds are not specified then a parser may assume 0.  If
the time is not specified then a parser may assume 12:00:00Z.

Here are some examples of valid dates

   1970-08-22

   2005-06-30T13:08
   1999-09-19T17:30Z
   1995-12-25T07:00-07:00
   1959-21-52T09:35+0300

   2000-01-01T01:23:45
   2009-04-15T23:02:31Z
   2001-10-22T21:39:12+01:00
   2042-03-18T01:19:00-11:15

====  SOURCES (detailed)

A sources request is a request for information about the data sets
available from a DAS server.  This may be a list of all data sources,
a list of all versions of a given data source, or information about a
specific version.  All three are done by fetching a sources document
given a URL.  The returned format is identical for all three cases,
except that some portions will require one element instead a list of
zero or more elements.

The sources request does not use query parameters.  A future version
of DAS may so servers should respond with an HTTP error code ("400 Bad
Request") if any are given.

The response document looks like the following:

Response:

Content-Type: application/x-das-sources+xml

<?xml version="1.0" encoding="UTF-8">
<SOURCES
    xmlns="http://www.biodas.org/ns/das/genome/2.00"
    xml:base="http://dev.wormbase.org/das/genome/">

  <MAINTAINER
    name="Yoyodyne DNA Systems"
    email="yoyodyna@example.com"
    href="http://www.example.com/" />

  <SOURCE id="volvox" title="Volvox Database" writeable="no"
      doc_href="http://www.example.org/volvox_db.pdf" taxid="3066">

    <VERSION id="volvox/build_1" title="Build 1, October 2002"
           created="2002-10-15" modified="2002-10-25T09:56:23">

      <MAINTAINER
        name="Volvox helpdesk"
        email="volvox-help@example.com" />
      <COORDINATES taxid="3066" source="chromosome" authority="NCBI">
         <VERSION name="35" />
      </COORDINATES>

      <COORDINATES taxid="2034" source="clone" authority="EMBL" />

      <CAPABILITY type="segments" query_id="volvox/1/segments">
          <FORMAT name="fasta" mimetype="text/x-fasta" />
          <FORMAT name="raw" mimetype="text/plain" />
      </CAPABILITY>
      <CAPABILITY type="types" query_id="volvox/1/types">
          <FORMAT name="das2xml" mimetype="text/x-das-types+xml" />
      </CAPABILITY>
      <CAPABILITY type="features" query_id="volvox/1/features">
          <FORMAT name="das2xml" mimetype="text/x-das-features+xml" />
      </CAPABILITY>
      <CAPABILITY type="locks" query_id="volvox/1/locks" />

    </VERSION>
  </SOURCE>
</SOURCES>

The MAINTAINER element is optional.  A server should provide at least
one of the 'name', 'email' or 'href' attributes.  The 'name' is short
human-readable text, the 'email' is an email address and the 'href' is
a URL meant for a human using a web browser.

The SOURCES element has zero or more SOURCE elements.  The 'id' is a
URL.  Each SOURCE must have a unique id.  A request on the SOURCE id
should be fetchable and respond with a sources document describing the
given data source.  The 'title' is a short label describing the source
to people.  The optional 'writeable' attribute is either 'yes' or
'no'.  The default is 'no'.  If 'yes' then the server supports
curational writeback. (XXX are we going to do it this way?)

The optional 'doc_href' attibute is a URL to more detailed information
about the source.  The optional 'taxid' is the NCBI taxon id for the
species.  (XXX isn't this redundant with the COORDINATES element?)

A SOURCE element has zero or more VERSION elements.  The definition of
what constitutes a new version is left to the data provider.  The
VERSION 'id' is a URL, which must be unique between the VERSION
elements in a SOURCE.  A request on the VERSION id should be fetchable
and respond with a sources document describing the specific version of
the data source.

The optional 'title' attribute of a VERSION element is a short label
describing the source to people.  The required 'created' attribute
states when the version was created and is an ISO timestamp.  The
optional 'modified' states when the version was most recently
modified.  If the modified attribute is not present then a client may
assume it has the same value as 'created'.

Each VERSION element may contain an optional MAINTAINER element, which
has the same syntax and meaning as the MAINTAINER element at the
SOURCES level.  It contains the contact information for the maintainer
of the specific data source, which may be different than the
maintainer for the server.  If the VERSION MAINTAINER is not present,
clients should use the SOURCES MAINTAINER instead for contact
information.

The optional COORDINATES elements, if present, fully characterize the
reference sequence.  If two annotations servers have the same
COORDINATES element, meaning the same 'authority' and 'source'
values', then they are annotations on the same reference sequence.
The 'authority' attribute is the name of the organization that
determined the coordinate system.  It is a name like 'NCBI', 'EMBL',
'Ensembl', 'HUGO_ID', 'IPI' or 'UniProt'.  The 'source' attribute
refers to the "physical dimension" of the coordinate system.  It is a
name like 'Chromosome', 'Clone', 'Contig', 'Gene_ID', 'NT_Contig',
'Protein Sequence', 'Protein Structure', or 'Scaffold'.  If the
optional 'taxid' attribute is present it is the NCBI taxonomy id of
the organism.

The Sanger Institute maintains a registery of authority and source
values at http://XXX.

The COORDINATES tag contains an optional 'test_range' attribute used
to test that the server is operational.  Experience with DAS1 found
that the web interface code often did not catch errors at the database
interface layer and would return empty results instead of correctly
reporting errors.  The test_range attribute is a value that can be
used in an 'inside' features filter.  The response after doing that
feature filter request must contain at least one feature.

There may be more than one COORDINATE element if ... (XXX why?)

The CAPABILITY elements describe what sort of queries a client may do
with the versioned source data.  The query is done through the URL
listed in the 'query_id' field.  Different query URLs support
different query interfaces.  The specific interface is listed in the
'type' field.  The specification defines the following query URL types:


    'type' value     for queries on
    ------------     --------------
     segments        the sequence data for the largest contiguous 
                       components in the data source
     types           the feature types
     features        the features
     locks           the locks, for writeback (define here or in
                           a sister "writeback spec"?)

A given type may not be used more than once. (XXX why not have more
than one "segments"?)

Relative 'query_id's are resolved according to the current xml:base.

A CAPABILITY has zero or more FORMAT elements, each with a 'name'
attribute.  These list the supported formats for the given capability.
To get the document in a given format, use the format's name in the
"format" parameter of the query.

This specification defines a standard set of format names.  For
details see the corresponding section.  Clients and servers may
support additional formats.


(XXX I earlier proposed a key/value table at the versioned element
level.  No one has used it or suggested it for anything.  I now
withdraw it.)

==== SEGMENTS (detailed)

Each reference sequence contains a set of segments.  A segment is the
largest chunk of contiguous sequence available.  For sequenced
organisms each chromosome will be a segment.  For partially assembled
genomes where the distance between assembled ranges is not known then
each partial fragment will have its own segment.

To get a list of all segments for a given data source use with the
versioned source record and find the <CAPABILITY> element which has a
"type" of "segments".  The query_url attribute is a URL.  Fetching
that URL returns a document with format name "das2xml" and
content-type "application/x-das-segments+xml".

Request:

http://www.biodas.org/das2/sequence/volvox/v3/segments.xml

Response:

Content-Type: application/x-das-segments+xml
<?xml version="1.0" encoding="UTF-8"?>
<SEGMENTS xmlns="http://www.biodas.org/ns/das/genome/2.00">
 <SEGMENT id="http://www.biodas.org/das2/sequence/volvox/v3/segment/Chr1"
     name="Chr1" length="12345" />
 <SEGMENT id="http://www.biodas.org/das2/sequence/volvox/v3/segment/Chr2"
     name="Chr2" length="77777" />
</SEGMENTS>


There are zero or more <SEGMENT> elements under the <SEGMENTS> root.
Each segment has three attributes.  The 'id' is the URL for the given
segment.  The name attribute is a short word used in the feature query
when specifing a segment-specific.  It must match the regular
expression pattern /[a-zA-z_][a-zA-Z0-9_]*/ .  The length attribute is
an integer which is the total number of residues in the segment.


XXX need to have information here about the supported formats.  See
the DAS mailing list thread "format information for the reference
server". 

== Segments query parameters

The segments query URL and the url for each segment support
form-urlencoded query parameters.  The optional 'format' query
specifies the data to return.  The default is 'das2xml' which is the
standard application/x-das-segments+xml document.

A DAS server must implement the "raw" and "fasta" format names.  The
raw format contains only the sequence data, with no record headers and
with the text folded every 78 characters or less.  The content-type
for raw sequence data should be "text/plain".

If partial or complete genomic assembly is available for a segment it
can be retrieved by requesting "agp" format.

The FASTA records should have a title containing the segment name.  A
client may ignore the title.  The content-type of the FASTA response
should be "text/x-fasta".


The individual segment URLs also support the 'range' parameter, which
returns a portion of the sequence instead of the entire sequence.  The
range value is of the form "$start:$end".  The start and end
coordinates are non-negative integers as described earlier.  The colon
(":") mut be escaped for use in a URL.

For example, the following retrieves all the sequence data for Chr2
in 'raw' format:

  http://www.biodas.org/h.sapiens/v22/Chr2?format=raw

and the following retrieves the 400 residues starting from the 501st
position, in FASTA format:

  http://www.biodas.org/h.sapiens/v22/Chr2?range=500%3A900;format=fasta

If the server does not support a requested format name then it must
respond with an HTTP error 400 "Bad Request".  If the requested
sequence is too large then it must respond with an HTTP error 413
"Request Entity Too Large" message.  A server should be able to supply
ranges of at least 1 megabase (XXX or smaller?).

If the requested range extends beyond the segment limits then the
server must respond with an HTTP error 400 "Bad Request".  (XXX or
send a response truncated to the available limits?)

=== FEATURES (detailed)

Each annotation server provides zero or more features.  Information
about the features is returned through the "features" CAPABILITY query
URL, which returns a document with format name "das2xml" and
content-type "application/x-das-segments+xml".  Each feature has a
unique identifier, which is a URL.  A server should support direct
fetching of a feature identifier.  If supported the default return
document must be in das2 format.

Here is an example of a DAS feature document

Request:

http://www.biodas.org/das2/sequence/volvox/v3/feature/hit12

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES
     xmlns="http://www.biodas.org/ns/das/genome/2.00"
     xml:base="http://www.biodas.org/das2/sequence/volvox/v3/">

 <FEATURE id="feature/hit12"
          type_id="type/est-alignment"
          name="EST alignment"
          created="2001-12-15T22:43:36"
          modified="2004-09-26T21:10:15"
          doc_href="http://www.biodas.org/notes/est_alignment_note341.html"
          >

   <LOC id="segment/Chr3" range="1201:1400:1" />
   <PART id="feature/hit12.hsp1" />
   <PART id="feature/hit12.hsp2" />
   <ALIGN target_id="feature/yk12391" range="200:299" />
   <STYLE fgcolor="black">
     <BOX border_width="2" />
     <LABEL font_family="sans-serif" />
   </STYLE>
   <PROP key="est2genomescore" value="180" />
 </FEATURE>
</FEATURES>

The FEATURE 'id' is the unique identifier for the given feature.  It
should be a fetchable URI.  All features have a feature type.  The
'type_id' is the unique identifier for the type, as described in the
types request.

The 'name' (XXX Should that be "title" or "description") is used for
.. what is it used for?  The "created" and "modified" attributes
contain ISO timestamps for when the feature annotation was first
created and most recently modified.  The 'doc_href' links to external
human readable documentation.

XXX we said this should contain a description of the link.
XXX Look at xlink?

The XID element says that ... XXX you know, I'm not sure what it does.
Everything I can think of suggests there should be more attributes to
the XID element.  Eg, the type of relationship (is-a, has-a).  Here's
what the current spec says

   Indicates that the feature corresponds to a DAS feature located on
   another data source (either local or remote). It is used to add
   annotations to a feature located on a remote server. (This type of
   functionality is sometimes called "gene DAS") A typical feature
   will either have a single <LOC> tag or a single <XID> tag, although
   it is possible (and sensible) to have one or more of both. Note
   that even though a feature has an <XID> tag, it is known to the
   local server by its internal id attribute given in the <FEATURE>
   tag.


A feature may be located on a segment, on mutiple segments, or even on
multiple regions of multiple segments.  The position information is
given by the LOC element.  Each LOC contains the single attribute
"segment" which is in one of three possible forms:

     $segment_name
     $segment_name/$start:$end
     $segment_name/$start:$end:$chain

For examples,

   <LOC segment="ChrX" />    -- All of "ChrX"
   <LOC segment="ChrX/200:305" />  -- 105 residues of ChrX
   <LOC segment="ChrX/200:305:1" />  -- forward strand of those 105 residues
   <LOC segment="ChrX/200:305:-1" /> -- reverse strand of those 105 residues


The segment name is the name given in the segments document.  A client
may use the segments document to map the name to a URL.

The biological annotation may be quite complex, and not directly
represented as a single feature with a range.  For example, a
processed mRNA annotation may refer to two exon regions.  (XXX I am so
not a biologist; need a better example.)  In DAS these annotations are
modeled through a parent/part relationship.  A feature may have one or
more parents and have one or more parts.  (XXX why "part" and not
"child"?)

The parent ids are available in the feature record through the 'id'
attributes of the list of <PARENT> elements.  The part ids are the
'id' attributes of the list of <PART> elements.

XXX fill in details here of various ways to model biological
annotations through features.  Develop a best-practices doc?


All feature locations are given in coordinates on a segment.  Some
features have alternate locations.  For example, a feature may be
located on a contig.  Each alternate coordinate location is stored in a
<REGION> element, which has 'id' and 'range' attributes.  The 'id' is
the URL for the feature defining the alternate coordinate system.  The
'range' is a string of the form "$start:$end".  It defines the feature
location in the coordinate system of the refered to feature.

For example, suppose feature A is 6 bases long and is on chromosome 5
at position 10000, on exon X at position 300 and on contig K at
position 7.  The FEATURE record for this feature may be as follows:

GET http://www.biodas.org/das2/sequence/volvox/v3/feature/A

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES  xmlns="http://www.biodas.org/ns/das/genome/2.00"
     xml:base="http://www.biodas.org/das2/sequence/volvox/v3/">
  <FEATURE id="feature/A" type_id="type/Type_A">

    <LOC id="segment/5" range="10000:10006">

    <REGION id="feature/exon_X" range="300:306" />
    <REGION id="feature/contig_K" range="7:13" />
  </FEATURE>
</FEATURES>


A feature may have zero or more <ALIGN> elements.  Each describes an
alignment of the given feature with another DAS features, which may be
on the same genome or a different one.  The 'target_id' attribute is
the URL for the aligned feature.  Use the optional 'range' attribute
if the alignment is to a part of the feature.

The optional 'gap' describes gaps that are present in the reference
and target strands of the alignment.  The gap data is stored as a
CIGAR-formatted string described in the Exonerate documentation.
http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html

Here is an example alignment  (XXX This has changed!)

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES  xmlns="http://www.biodas.org/ns/das/genome/2.00"
     xml:base="http://www.biodas.org/das2/sequence/volvox/v3/">
  <FEATURE id = "feature/hit12.hsp2"
           type = "type/est-alignment-hsp"
     <PARENT id="feature/hit12" />
     <LOC id="region/Chr3" range="1351:1400:1" />
     <ALIGN target_id="region/yk12391" range="53:100" gap="M20 D1 G1 M30" />
     <PROP  key="est2genomescore" value="120" />
  </FEATURE>
</FEATURES>


A feature may have a <STYLE> element describing how to display the
feature.  If present this <STYLE> overrides the <STYLE> element from
the feature type.  See <some other section> for details on the STYLE
element.

A feature may have a list of <PROP> elements and a section of non-DAS
non-DAS extension elements.  See <some other section> for details.

A feature record may be retrievable in multiple alternate formats.
The list of format names is available from the format type record.  To
get the record in a different format use the "format=" parameter in
the query string followed by the format name.

If the server does not support a requested format name then it must
respond with an HTTP error 400 "Bad Request".

== feature query URL and filters

With no query parameters the feature query URL should return all of
the feature in das2xml format.  If there are too many features to
return a server must instead respond with an HTTP error 413 "Request
Entity Too Large" so the client can search over a smaller range.

(XXX I don't fully understand the behaviour.  Suppose the server says
the max size is 5 features at a time.  How does a client figure out
the correct behaviour?)

An annotation server should implement the DAS2 query filter language
so a client can ask for a subset of the available features.  The query
language is based on list of key/value pair.  The search keys are:

  name      |  takes | matches features ...
 ==========================
  xid       |  URL   | which have the given xid
  type      |  URL   | with the given type or supertype
  exacttype |  URL   | with exactly the given type
  overlaps  | region | which overlap the given region
  inside    | region | which are contained inside the given region
  contains  | region | which contain the given region
  identical | region | which exactly fit in the given region
  name      | string | with a name or alias which equals the given substring
  prop-*    | string | with the property "*" containing the given substring


Fields with the same key name are implemented as "or" searches.  For
example,

   name = CHCR
   name = AB077698

is the same as

   "has a name or alias of 'CHCR' or of 'AB077698'"

Fields with different key names are implemented as "and" searches.
For example,

   type = http://www.biodas.org/ontology/promoter
   inside = Chr3/0:1000

is the same as (assuming that the type URL refers to a promoter)

  "any promoter on the first 1,000 bases of Chr3"

The fields are form-urlencoded URLs based on the feature query string.
If there are multiple query fields with the same key then there will
be multiple fields in the URL query string.

For example, given a query URL of 
then the filter queries for the previous two example are

  http://www.biodas.org/features.cgi?name=CHCR;name=AB077698

  http://www.biodas.org/features.cgi?type=http%3A%2F%2Fwww.biodas.org%2Fontology%2Fpromoter;inside=Chr3%2F0%3A1000


====

[coordinate-system]

We make a distinction between "coordinate system" and "numbering
system".  The coordinate system is the set of segment on which
features are located.  The numbering system describes how to identify
the specific residues in the segment.  DAS uses a 0-based coordinate
system where the first residue is numbered "0", the second "1", and so
on.  Other numbering systems include 1-based coordinates and the PDB
numbering system which preserves the residue number for key residues
across homologous family by allowing discontinuities, insertions and
negative values as position numbers.