DAS is a protocol for sharing biological data. This version of the specification, DAS 2.0, describes features located on the genomic sequence. Future versions will add support for sharing annotations of protein sequences, expression data, 3D structures and ontologies. The genomic DAS interface is deliberately designed so there will be a large core shared with the protein sequence DAS. A DAS 2.0 annotation server provides feature information about one or more genome sources. Each source may have one or more versions. Different versions are usually based on different assemblies. As an implementation detail an assembly and corresponding sequence data may be distributed via a different machine, which is called the reference server. Annotations are located on the genomic sequence with a start and end position. The range may be specified multiple times if there are alternate coordinate systems. An annotation may contain multiple non-continguous parts, making it the parent of those parts. Some parts may have more than one parent. Annotations have a type based on terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets contain a set of properties used to depict a given type. Annotations can be searched by range, type, and a properties table associated with each annotation. These are called feature filters. DAS 2.0 is implemented using a ReST architecture. Each document (also called an entity or object) has a name, which is a URL. Fetching the URL gets information about the document. The DAS-specific documents are all in XML. Other data types have existing widely used formats, and sometimes more than one for the same data. A DAS server may provide a distinct document for each of these formats, along with information about which formats are available. DAS 2.0 addresses some shortcomings of the DAS 1.x protocol, including: * Better support for hierachical structures (e.g. transcript + exons) * Ontology-based feature annotations * Allow multiple formats, including formats only appropriate for some feature types * A lock-based editing protocol for curational clients * An extensible namespacing system that allows annotations in non-genomic coordinates (e.g. uniprot protein coordinates or PDB structure coordinates) ===== The SOURCES document (overview) A DAS server supplies information about genomic sequence data sources. The collection of all sources, each data source, and each version of a data source are accessible through a URL. All three classes of URLs return a document of content-type 'application/x-das-sources+xml' though likely with differing amounts of detail. A 'versioned source' request returns information only about a specific version of a data source. A 'source' request returns the list of all the versioned source data for that source. A 'sources' request returns the list of all the source data, including all the versioned source data. The URLs might not be distinct. For example, a server with only one version of one data source may use the same URL for all three documents, and a server for a single organism may use the same URL for the 'sources' and 'source' documents. Most servers will list only the data sources provided by that server. Some servers combine the sources documents from other servers into a single document. These registry servers act as a centralized index and reduce configuration and network overhead. A registry server uses the same sources format as an annotation server. Here is an example of a simple sources document which makes no distinction between the three sources categories. It, like the others new DAS formats, is in XML. All of the DAS elements are in the XML namespace http://www.biodas.org/ns/das/genome/2.00 (XXX Is that still correct?). This namespace is reserved and authors of DAS extensions may not create new elements in it. Request: http://www.example.com/das/genome/yeast.xml Response: Content-Type: application/x-das-sources+xml All identifiers and href attributes in DAS documents follow the XML Base specification (see http://www.w3.org/TR/xmlbase/ ) in resolving partial identifiers and href attributes. In this case the relative id "yeast.xml" is fully resolved using the xml:base of "http://www.example.com/das/genome/" to "http://www.example.com/das/genome/yeast.xml". If the result after resolving through all the parent xml:base attributes is still a relative URL then it is resolved once more with respect to the URL used to fetch the document. Here is an example of a more complicated sources document with multiple organisms each with multiple versions. Each of the two source documents (one for each organism) has a distinct URL as does each of the version for each organism. This is a pure registry server because the actual annotation data comes from other machines. Request: http://www.biodas.org/known_das_servers Response: Content-Type: application/x-das-sources+xml Each SOURCE id and VERSION id is individually fetchable so the URL "http://das.ensembl.org/das/SPICEDS/" returns a sources document with the SOURCE record for "das_vega_trans" and both of its VERSION subelements while "http://das.ensembl.org/das/SPICEDS/128/" returns a sources document with only the second of its VERSION subelements. DAS documents refer to other documents through URLs. There are no restrictions on the internal form of the URLs, other than the query string portion. Server implementers are free to choose URLs which best fit the architecture needs. For example, a simple DAS server may be implemented as a set of XML files hosted by a standard web server while more complex servers with search support may be implemented as CGI scripts or through embedded web server extensions. The URLs do not need to define a hierarchical structure nor even be on the same machine. Compare this to the DAS1 specification where some URLs were constructed by direct string modification of other URLs. ===== The SEGMENTS document (overview) Each versioned source contains a set of segments. A segment is the largest chunk of contiguous sequence. For fully sequenced organisms a segment may be a chromosome. For partially assembled genomes where the distance between the assembled regions is not known then each region may be its own segment. If a server provides annotations in contig space then each contig is a segment. Feature locations are specified on ranges of segments which is why a specific set of segments is called a coordinate system. [coordinate-system] This specification does not describe how to do alignments between different coordinate systems. The sources document format has two ways to describe the coordinate system. The optional COORDINATES element uniquely characterize the coordinate system. If two data sources have the same authority and source values then they must be annotations on the same coordinate system. The specific coordinate system is also called the "reference sequence". A versioned source may contain CAPABILITY elements which describe different ways to request additional data from a DAS server. Each CAPABILITY has a type that describes how to use the corresponding URL to query a DAS server. A CAPABILITY element of type "segments" has a query URL which returns a document of content-type "application/x-das-segments+xml". A segments document lists information about the segments in the coordinate system. Here is an example of a segments document. Request: http://www.biodas.org/das2/h.sapiens/v3/segments.xml Response: Content-Type: application/x-das-segments+xml Note that unlike the previous examples this document defined the new namespace abbreviation "das" instead of defining a default namespace. ===== The FEATURES document (overview) The versioned source record for an annotation server must include a CAPABILITY of type "features". A client may use the query URL from the features CAPABILTY points to select features which match certain criteria. If no criteria are specified the server must return all features unless there are too many features to return. In that case it must respond with an error message. Unless an alternate format is specified, the response from the features query is a document of content-type "application/x-das-features+xml" containing all of the matching features. Here is an example features document for a server which contains a gene and an alignment. Request: http://das.biopackages.net/das/genome/yeast/S228C/features.pl Response: Content-Type: application/x-das-features+xml Each feature has a unique identifier and an identifer linking it to a type record. Both identifiers are URLs and should be directly fetchable. Simple features can be located on a region of a segment. More complex features like a gapped alignment are represented through a parent/part relationship. A feature may have multiple parents and multiple parts. ===== Feature Filters (overview) An annotation server may contain many features while the client may only be interested in a subset; most likely features in a given portion of the reference sequence. To help minimize the bandwidth overhead the feature query URL should support the DAS feature filter language. The syntax uses the standard HTML form-urlencoded query syntax. For example, here is a request for all features on Chr2. Request: http://www.example.org/volvox/1/features.cgi?inside=Chr2 Response: Content-Type: application/x-das-features+xml and here is the rather long one for all EST alignments Request: http://www.example.org/volvox/1/features.cgi?type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment Response: Content-Type: application/x-das-features+xml ===== The TYPES document (overview) All features are linked to a type record. DAS types do not describe a formal type system in that DAS types do not derive from other DAS types. Instead it links to an external ontology term and describes how to depict features of that type. A DAS annotation server must contain a CAPABILITY element of type "types". A client may use its query URL to fetch a document of content-type "application/x-das-types+xml". The document lists all of the types available on the server. We expect that servers will have at most a few dozen types so DAS does not support type filters. The following is a hypothetical example of a DAS annotation server providing GENSCAN gene predictions for zebrafish. Each feature is either of type "http://www.example.org/das/zebrafish/build19/high-type" or "http://www.example.org/das/zebrafish/build19/low-type" depending on if the data provider determined it was a high probability or low probability prediction. Even though there are two different type records the refer to the same ontology term, in this case the SO term for "gene". The distinction exists so that the high probability features are depicted differently from the low probability features. Request: http://www.example.org/das/zebrafish/build19/types Response: Content-Type: application/x-das-types+xml ===== Formats and extensibility A DAS server may support additional formats to the ones defined in this specification. For example, client and server developers may decide to use a more compact features representation, for better performance. The server should list the available formats in the CAPABILITY section of the SOURCES document. For example, the following says that the server implements three formats. The format name "das2xml" is reserved for the formats defined by this specification. The other two format names are hypothetical: To request an alternate format the client must add a "format=" field to the query string of the URL. For example, to request all of the features from the above example but in "das3xml" the client makes a request for: http://example.com/das/features.xml?format=das3xml while to get all the features on Chr3 ins "compact-binary" format the client makes a request for http://example.com/das/features.xml?inside=Chr3;format=das3xml Servers may extend the features filter language to add new capabilities as long as those terms do not affect queries without those fields. A server may list support for a query extension using the SUPPORTS tag. In the following the server says it supports the "curation-search" as well as the das2xml and compact-binary formats: The client implementer must use some other means to discover what additional filters are available for a "curation-search". A server may support additional capabilities not defined by this specification and list support for it through a new CAPABILITY item. For example, in the following the hypothetical server implements an alternative query language based on XQuery. The contents of the non-DAS2 CAPABILITY elements is determined by the server implementer and a client implementer must look elsewhere to discover what it means. ===== Details This specification makes extensive use of URLs (URI? IRIs?). While non-HTTP URLs are possible the exchange protocol uses concepts like request action and headers, response code and headers, and query string construction which only make sense in the context of HTTP and related protocols. === Response code All servers must reply with the appropriate HTTP status code and clients must react accordingly. === Content-Type header Each of the five new formats has its own MIME type. These are application/x-das-sources+xml application/x-das-features+xml application/x-das-types+xml application/x-das-segments+xml application/x-das-errors+xml A server should include the correct MIME type in its the Content-Type header of the response. If not it must respond with "application/xml" and must not respond with text/xml. Character encoding is determined as per RFC 3023. We recommend that server implementers either not include the charset parameter in the Content-Type header or ensure that it is identical to the encoding in the document's XML declaration. For use during specification development a server may include a "version" value so clients can determine which version of the spec is implemented by the server. Unless others can convince me otherwise this will be removed in the final specification. Example: Content-Type: application/x-das-types+xml; version=300 The list of versions is as follows: 100 - the version as of 2006/02/07 200 - the version as of 2006/02/10 (changed the feature query language format) (using "prop-" instead of "att" for property searches) 300 - the version as of 2006/03/10, which includes the updates from the first sprint. If not present the client may assume the format is in the most recent version. ==== Segment Locations Segment locations are used in three places in DAS: feature locations, range-based feature filters, and sequence retrieval. Every location is on a segment, refered to by name. Unlike every other item in DAS this name is not a URL. It comes from the "name" attribute of the SEGMENT element in the SEGMENTS document. If two reference servers serve the same coordinate system then the core segment data -- segment name, size, and sequence data -- must be identical. A client may get additional information from any equivalent reference server or use other means based on knowledge of the coordinate system. The residue location in a segment is given as an offset from the first position, which has a location of 0. The second position is 1, the third is 2, and so on. Segment ranges are given by start and end positions. The interval is half-open meaning that the interval from 'start' to 'end' includes the residues from position 'start' up to but not including position 'end'. For example, the range (3,6) includes the residues at positions 3, 4 and 5 but not the one at positions 2 or 6. This scheme is sometime refered to as "interbase coordinates". The end coordinate of a range is never less than the start position. The range (5,6) covers the residue at position 5 while (5,5) has size of zero and refers to the point between positions 4 and 5. Cleavage site annotations may use zero size annotations like the latter. Features may be located on a strand. XXX I forgot what we said here; 1 for positive, -1 for negative, 0 for unknown and not given for both? Feature locations are given in a shorthand notation. The segment name is required. If only the segment name is given then the feature location is on the entire segment. The range is optional. If present it occurs after the segment name and the short-hand notation is in the form: SegmentName + "/" + start + ":" + end The frame is optional. If not given then the feature is on both stands. If present then the range must also be present. The short-hand notation with the strand identifier is SegmentName + "/" + start + ":" + end + ":" + strand Here are some examples of the feature location use in a element. -- all residues of Chr1 -- the 1st and 2nd residues of Chr2 -- the site between the 19th and 20th residues of Chr3 -- the negative strand of Chr1 (assuming a length of 245522847) The feature filters use the same short-hand notation except without the strand identifier. Clients that want features on a specific strand must post-process the returned list of features. Client that want the sequence for the negative strand must compute the reverse complement of the forward strand. The forward slash ('/') and colon (':') characters have special meaning in URLs so should be URL-escaped. Here are some example query URLs: All features that on Chr1 or Chr2 http://www.biodas.org/das2/h.sapiens/v37/features?overlaps=Chr1;overlaps=Chr2 All features that overlap residues 200-300 of Chr1 ("Chr1/200:300") http://www.biodas.org/das2/h.sapiens/v37/features?overlaps=Chr1%2F200%3A300 Sequence retrieval queries work directly on the segment id so do not need the segment name. The range is passed using the "range" key of the query, as in the following: The sequence for Contig4392 http://www.biodas.org/das2/h.sapiens/v37/sequence/Contig4392 The sequence for the first 10 residues of Contig4392 ("range=0:10") http://www.biodas.org/das2/h.sapiens/v37/sequence/Contig4392?range=0%3A10 The same sequence but in "raw" format http://www.biodas.org/das2/h.sapiens/v37/sequence/Contig4392?range=0%3A10;format=raw == ISO Dates === Several elements have 'created' and 'modified' attributes. These dates are formatted in a subset of ISO 8601. http://www.w3.org/TR/NOTE-datetime Data providers must write the date using one of the following forms * Complete date: YYYY-MM-DD (e.g. 1997-07-16) * Complete date plus hours and minutes: YYYY-MM-DDThh:mmTZD (e.g. 1997-07-16T19:20+01:00) * Complete date plus hours, minutes and seconds: YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00) where:
      YYYY = four-digit year
      MM   = two-digit month (01=January, etc.)
      DD   = two-digit day of month (01 through 31)
      hh   = two digits of hour (00 through 23) (am/pm NOT allowed)
      mm   = two digits of minute (00 through 59)
      ss   = two digits of second (00 through 59)
      TZD  = time zone designator (optional; one of the formats
                     "Z", +hh:mm, +hhmm, -hh:mm, or -hhmm)
      
If the timezone designator is not specified a parser may assume 'Z'. If the seconds are not specified then a parser may assume 0. If the time is not specified then a parser may assume 12:00:00Z. Here are some examples of valid dates 1970-08-22 2005-06-30T13:08 1999-09-19T17:30Z 1995-12-25T07:00-07:00 1959-21-52T09:35+0300 2000-01-01T01:23:45 2009-04-15T23:02:31Z 2001-10-22T21:39:12+01:00 2042-03-18T01:19:00-11:15 ==== SOURCES (detailed) A sources request is a request for information about the data sets available from a DAS server. This may be a list of all data sources, a list of all versions of a given data source, or information about a specific version. All three are done by fetching a sources document given a URL. The returned format is identical for all three cases, except that some portions will require one element instead a list of zero or more elements. The sources request does not use query parameters. A future version of DAS may so servers should respond with an HTTP error code ("400 Bad Request") if any are given. The response document looks like the following: Response: Content-Type: application/x-das-sources+xml The MAINTAINER element is optional. A server should provide at least one of the 'name', 'email' or 'href' attributes. The 'name' is short human-readable text, the 'email' is an email address and the 'href' is a URL meant for a human using a web browser. The SOURCES element has zero or more SOURCE elements. The 'id' is a URL. Each SOURCE must have a unique id. A request on the SOURCE id should be fetchable and respond with a sources document describing the given data source. The 'title' is a short label describing the source to people. The optional 'writeable' attribute is either 'yes' or 'no'. The default is 'no'. If 'yes' then the server supports curational writeback. (XXX are we going to do it this way?) The optional 'doc_href' attibute is a URL to more detailed information about the source. The optional 'taxid' is the NCBI taxon id for the species. (XXX isn't this redundant with the COORDINATES element?) A SOURCE element has zero or more VERSION elements. The definition of what constitutes a new version is left to the data provider. The VERSION 'id' is a URL, which must be unique between the VERSION elements in a SOURCE. A request on the VERSION id should be fetchable and respond with a sources document describing the specific version of the data source. The optional 'title' attribute of a VERSION element is a short label describing the source to people. The required 'created' attribute states when the version was created and is an ISO timestamp. The optional 'modified' states when the version was most recently modified. If the modified attribute is not present then a client may assume it has the same value as 'created'. Each VERSION element may contain an optional MAINTAINER element, which has the same syntax and meaning as the MAINTAINER element at the SOURCES level. It contains the contact information for the maintainer of the specific data source, which may be different than the maintainer for the server. If the VERSION MAINTAINER is not present, clients should use the SOURCES MAINTAINER instead for contact information. The optional COORDINATES elements, if present, fully characterize the reference sequence. If two annotations servers have the same COORDINATES element, meaning the same 'authority' and 'source' values', then they are annotations on the same reference sequence. The 'authority' attribute is the name of the organization that determined the coordinate system. It is a name like 'NCBI', 'EMBL', 'Ensembl', 'HUGO_ID', 'IPI' or 'UniProt'. The 'source' attribute refers to the "physical dimension" of the coordinate system. It is a name like 'Chromosome', 'Clone', 'Contig', 'Gene_ID', 'NT_Contig', 'Protein Sequence', 'Protein Structure', or 'Scaffold'. If the optional 'taxid' attribute is present it is the NCBI taxonomy id of the organism. The Sanger Institute maintains a registery of authority and source values at http://XXX. The COORDINATES tag contains an optional 'test_range' attribute used to test that the server is operational. Experience with DAS1 found that the web interface code often did not catch errors at the database interface layer and would return empty results instead of correctly reporting errors. The test_range attribute is a value that can be used in an 'inside' features filter. The response after doing that feature filter request must contain at least one feature. There may be more than one COORDINATE element if ... (XXX why?) The CAPABILITY elements describe what sort of queries a client may do with the versioned source data. The query is done through the URL listed in the 'query_id' field. Different query URLs support different query interfaces. The specific interface is listed in the 'type' field. The specification defines the following query URL types: 'type' value for queries on ------------ -------------- segments the sequence data for the largest contiguous components in the data source types the feature types features the features locks the locks, for writeback (define here or in a sister "writeback spec"?) A given type may not be used more than once. (XXX why not have more than one "segments"?) Relative 'query_id's are resolved according to the current xml:base. A CAPABILITY has zero or more FORMAT elements, each with a 'name' attribute. These list the supported formats for the given capability. To get the document in a given format, use the format's name in the "format" parameter of the query. This specification defines a standard set of format names. For details see the corresponding section. Clients and servers may support additional formats. (XXX I earlier proposed a key/value table at the versioned element level. No one has used it or suggested it for anything. I now withdraw it.) ==== SEGMENTS (detailed) Each reference sequence contains a set of segments. A segment is the largest chunk of contiguous sequence available. For sequenced organisms each chromosome will be a segment. For partially assembled genomes where the distance between assembled ranges is not known then each partial fragment will have its own segment. To get a list of all segments for a given data source use with the versioned source record and find the element which has a "type" of "segments". The query_url attribute is a URL. Fetching that URL returns a document with format name "das2xml" and content-type "application/x-das-segments+xml". Request: http://www.biodas.org/das2/sequence/volvox/v3/segments.xml Response: Content-Type: application/x-das-segments+xml There are zero or more elements under the root. Each segment has three attributes. The 'id' is the URL for the given segment. The name attribute is a short word used in the feature query when specifing a segment-specific. It must match the regular expression pattern /[a-zA-z_][a-zA-Z0-9_]*/ . The length attribute is an integer which is the total number of residues in the segment. XXX need to have information here about the supported formats. See the DAS mailing list thread "format information for the reference server". == Segments query parameters The segments query URL and the url for each segment support form-urlencoded query parameters. The optional 'format' query specifies the data to return. The default is 'das2xml' which is the standard application/x-das-segments+xml document. A DAS server must implement the "raw" and "fasta" format names. The raw format contains only the sequence data, with no record headers and with the text folded every 78 characters or less. The content-type for raw sequence data should be "text/plain". If partial or complete genomic assembly is available for a segment it can be retrieved by requesting "agp" format. The FASTA records should have a title containing the segment name. A client may ignore the title. The content-type of the FASTA response should be "text/x-fasta". The individual segment URLs also support the 'range' parameter, which returns a portion of the sequence instead of the entire sequence. The range value is of the form "$start:$end". The start and end coordinates are non-negative integers as described earlier. The colon (":") mut be escaped for use in a URL. For example, the following retrieves all the sequence data for Chr2 in 'raw' format: http://www.biodas.org/h.sapiens/v22/Chr2?format=raw and the following retrieves the 400 residues starting from the 501st position, in FASTA format: http://www.biodas.org/h.sapiens/v22/Chr2?range=500%3A900;format=fasta If the server does not support a requested format name then it must respond with an HTTP error 400 "Bad Request". If the requested sequence is too large then it must respond with an HTTP error 413 "Request Entity Too Large" message. A server should be able to supply ranges of at least 1 megabase (XXX or smaller?). If the requested range extends beyond the segment limits then the server must respond with an HTTP error 400 "Bad Request". (XXX or send a response truncated to the available limits?) === FEATURES (detailed) Each annotation server provides zero or more features. Information about the features is returned through the "features" CAPABILITY query URL, which returns a document with format name "das2xml" and content-type "application/x-das-segments+xml". Each feature has a unique identifier, which is a URL. A server should support direct fetching of a feature identifier. If supported the default return document must be in das2 format. Here is an example of a DAS feature document Request: http://www.biodas.org/das2/sequence/volvox/v3/feature/hit12 Response: Content-Type: application/x-das-features+xml The FEATURE 'id' is the unique identifier for the given feature. It should be a fetchable URI. All features have a feature type. The 'type_id' is the unique identifier for the type, as described in the types request. The 'name' (XXX Should that be "title" or "description") is used for .. what is it used for? The "created" and "modified" attributes contain ISO timestamps for when the feature annotation was first created and most recently modified. The 'doc_href' links to external human readable documentation. XXX we said this should contain a description of the link. XXX Look at xlink? The XID element says that ... XXX you know, I'm not sure what it does. Everything I can think of suggests there should be more attributes to the XID element. Eg, the type of relationship (is-a, has-a). Here's what the current spec says Indicates that the feature corresponds to a DAS feature located on another data source (either local or remote). It is used to add annotations to a feature located on a remote server. (This type of functionality is sometimes called "gene DAS") A typical feature will either have a single tag or a single tag, although it is possible (and sensible) to have one or more of both. Note that even though a feature has an tag, it is known to the local server by its internal id attribute given in the tag. A feature may be located on a segment, on mutiple segments, or even on multiple regions of multiple segments. The position information is given by the LOC element. Each LOC contains the single attribute "segment" which is in one of three possible forms: $segment_name $segment_name/$start:$end $segment_name/$start:$end:$chain For examples, -- All of "ChrX" -- 105 residues of ChrX -- forward strand of those 105 residues -- reverse strand of those 105 residues The segment name is the name given in the segments document. A client may use the segments document to map the name to a URL. The biological annotation may be quite complex, and not directly represented as a single feature with a range. For example, a processed mRNA annotation may refer to two exon regions. (XXX I am so not a biologist; need a better example.) In DAS these annotations are modeled through a parent/part relationship. A feature may have one or more parents and have one or more parts. (XXX why "part" and not "child"?) The parent ids are available in the feature record through the 'id' attributes of the list of elements. The part ids are the 'id' attributes of the list of elements. XXX fill in details here of various ways to model biological annotations through features. Develop a best-practices doc? All feature locations are given in coordinates on a segment. Some features have alternate locations. For example, a feature may be located on a contig. Each alternate coordinate location is stored in a element, which has 'id' and 'range' attributes. The 'id' is the URL for the feature defining the alternate coordinate system. The 'range' is a string of the form "$start:$end". It defines the feature location in the coordinate system of the refered to feature. For example, suppose feature A is 6 bases long and is on chromosome 5 at position 10000, on exon X at position 300 and on contig K at position 7. The FEATURE record for this feature may be as follows: GET http://www.biodas.org/das2/sequence/volvox/v3/feature/A Content-Type: application/x-das-features+xml A feature may have zero or more elements. Each describes an alignment of the given feature with another DAS features, which may be on the same genome or a different one. The 'target_id' attribute is the URL for the aligned feature. Use the optional 'range' attribute if the alignment is to a part of the feature. The optional 'gap' describes gaps that are present in the reference and target strands of the alignment. The gap data is stored as a CIGAR-formatted string described in the Exonerate documentation. http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html Here is an example alignment (XXX This has changed!) A feature may have a