[DAS2] Sequence retrieval proposal
Steve Chervitz
Steve_Chervitz at affymetrix.com
Thu Dec 8 21:04:56 UTC 2005
On Thu, 8 Dec 2005, Thomas Down wrote:
>
> On 7 Dec 2005, at 23:22, Andrew Dalke wrote:
>>
>> Steve Chervitz wrote:
>>>
>>> 2. What do folks think about specifying a DAS2XML format for sequence
>>> requests (text/x-das-sequence+xml)? In addition to permitting an optional
>>> checksum attribute to address the above use case, it would add some
>>> consistency and flexibility to the spec, since at present, the default
>>> sequence response format is the only one that is not under our control
>>> (currently it's text/x-fasta).
>>
>> As a consumer of this sort of data, I don't want to write another
>> parser. It isn't just the parsing part - it's the effort of mapping
>> to my program's data model.
>>
>> There's already a huge number of existing sequence file formats.
>> What would another provide? Are some of them already extensible?
I am also somewhat loath to add yet another sequence file format to the
world. Seems reasonable to state that a DAS/2 server can supply sequence in
an alternative format via requests such as:
http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME
There would have to be a way for a server to indicate what alternative
formats is supports. We could use the same strategy as we do in the
versioned source request, supplying a FORMAT element listing alternative
formats. But where to put it? Perhaps in the regions request:
<REGIONS ...
<REGION id="sequence/ctg1" ...>
<FORMAT id="game" type="application/x-game+xml" />
<FORMAT id="otter" type="application/x-otter+xml" />
</REGION>
</REGIONS>
For interoperability purposes, we'd should provide a controlled vocabulary
of alternative formats and their types, at least for the commonly used ones.
>> Several of those formats are designed and developed by people involved
>> with DAS. If it's important, extend GAME or GFF.
>
> Do GAME or GFF have a sequence representation? I thought they were
> both primarily feature-table formats (right now I'm having trouble
> finding the GAME documentation though...).
Here's a brief tour of some possibly extensible candidates:
GFF - only represents features: http://song.sourceforge.net/gff3.shtml
GAME - does encode sequence data as a simple string.
Flybase/BDGP use GAME XML and appear to be the main users/maintainers. Suzi
and Chris can elaborate more here, but I found link to an RNG schema in the
Apollo FAQ:
http://www.fruitfly.org/annot/apollo/game.rng.txt
GAME notes:
- The http://bioxml.org links are now obsolete. Here's an old description
containing such links: http://xml.coverpages.org/game.html
- GAME variants have arisen that have created incompatibilities in the bio*
world: http://open-bio.org/pipermail/bioperl-l/2003-April/011988.html
- When I checked a flybase data file, it didn't point to a DTD:
ftp://flybase.net/genomes/Drosophila_melanogaster/current/xml-game/
Otter - a sort of simplified GAME that also represents sequence:
http://www.sanger.ac.uk/Users/jgrg/otter_xml.html
XFF - models sequences and has alphabet support (Thomas: is this in use?):
http://www.biojava.org/thomasd/XFF/
INSDseq and EMBLxml - An XML format for Gebank/EMBL/DDBJ sequence data:
http://www.ebi.ac.uk/xembl/
BSML - Somewhat antiquated but is supported by the XEMBL service
http://www.bsml.org/ and in use by LabBook:
http://www.labbook.com/default.aspx
AGAVE - From DoubleTwist - now defunct, but also supported by XEMBL:
http://www.agavexml.org/
BIOML - Details are sketchy, appears to be used internally by Genomic
Solutions which acquired Proteometrics, the originators of BIOML. Here's the
most recent references I could find:
http://www.biomedcentral.com/1471-2105/5/25
http://www.genomicsolutions.com/showPage.php?title=Data%20Integration
> The problem I have with Fasta format (other than the tendency of many
> data-providers to over-load the header line) is that there's no
> explicit marker for the alphabet and encoding of sequence data. This
> is pretty nasty for codebases like BioJava which want to present a
> richer view of sequence data than just a String. I'd certainly be in
> favour of a nice XML format that made alphabet information explicit.
> The DAS 1.5 DASSEQUENCE document has a moltype attribute which
> supports this (at least the three most important cases, DNA/RNA/
> Protein -- there's not a standards-compliant way to add other
> alphabets though).
Various data providers take all sorts of liberties with fasta sequence,
e.g., sequences with no IDs, whitespace-containing IDs, space between the
'>' and the ID, etc.
We might consider proscribing some conventions for what DAS considers proper
fasta format. I put in a little bit of description of a DAS-acceptable fasta
format here in the retrieval spec:
http://biodas.org/documents/das2/das2_get.html#sequence
Do we want to add more to this? Perhaps something about an optional
description being separated from the ID by whitespace and consisting of any
amount of free-form text.
Steve
> I guess an alternative, more classically RESTful, way of doing things
> might be with MIME types:
>
> Content-Type: application/fasta; sequence-alphabet=DNA;
> sequence-encoding=IUPAC
>
> I admit I'd prefer the XML though...
>
>
> Thomas.
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2
More information about the DAS2
mailing list