[DAS2] Sequence retrieval proposal

Thu Dec 8 21:04:56 UTC 2005

On Thu, 8 Dec 2005, Thomas Down wrote:
> 
> On 7 Dec 2005, at 23:22, Andrew Dalke wrote:
>>
>> Steve Chervitz wrote:
>>>
>>> 2. What do folks think about specifying a DAS2XML format for sequence
>>> requests (text/x-das-sequence+xml)? In addition to permitting an optional
>>> checksum attribute to address the above use case, it  would add some
>>> consistency and flexibility to the spec, since at  present, the default
>>> sequence response format is the only one that is  not under our control
>>> (currently it's text/x-fasta).
>> 
>> As a consumer of this sort of data, I don't want to write another
>> parser.  It isn't just the parsing part - it's the effort of mapping
>> to my program's data model.
>> 
>> There's already a huge number of existing sequence file formats.
>> What would another provide?  Are some of them already extensible?

I am also somewhat loath to add yet another sequence file format to the
world. Seems reasonable to state that a DAS/2 server can supply sequence in
an alternative format via requests such as:

  http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME

There would have to be a way for a server to indicate what alternative
formats is supports. We could use the same strategy as we do in the
versioned source request, supplying a FORMAT element listing alternative
formats. But where to put it? Perhaps in the regions request:

<REGIONS ...
   <REGION id="sequence/ctg1" ...>
     <FORMAT id="game"    type="application/x-game+xml" />
     <FORMAT id="otter"    type="application/x-otter+xml" />
   </REGION>
</REGIONS>

For interoperability purposes, we'd should provide a controlled vocabulary
of alternative formats and their types, at least for the commonly used ones.

>> Several of those formats are designed and developed by people involved
>> with DAS.  If it's important, extend GAME or GFF.
> 
> Do GAME or GFF have a sequence representation?  I thought they were
> both primarily feature-table formats (right now I'm having trouble
> finding the GAME documentation though...).

Here's a brief tour of some possibly extensible candidates:

GFF - only represents features: http://song.sourceforge.net/gff3.shtml

GAME - does encode sequence data as a simple string.
Flybase/BDGP use GAME XML and appear to be the main users/maintainers. Suzi
and Chris can elaborate more here, but I found link to an RNG schema in the
Apollo FAQ: 
http://www.fruitfly.org/annot/apollo/game.rng.txt

GAME notes:
- The http://bioxml.org links are now obsolete. Here's an old description
containing such links: http://xml.coverpages.org/game.html
- GAME variants have arisen that have created incompatibilities in the bio*
world: http://open-bio.org/pipermail/bioperl-l/2003-April/011988.html
- When I checked a flybase data file, it didn't point to a DTD:
ftp://flybase.net/genomes/Drosophila_melanogaster/current/xml-game/

Otter - a sort of simplified GAME that also represents sequence:
http://www.sanger.ac.uk/Users/jgrg/otter_xml.html

XFF - models sequences and has alphabet support (Thomas: is this in use?):
http://www.biojava.org/thomasd/XFF/

INSDseq and EMBLxml - An XML format for Gebank/EMBL/DDBJ sequence data:
http://www.ebi.ac.uk/xembl/

BSML - Somewhat antiquated but is supported by the XEMBL service
http://www.bsml.org/ and in use by LabBook:
http://www.labbook.com/default.aspx

AGAVE - From DoubleTwist - now defunct, but also supported by XEMBL:
http://www.agavexml.org/

BIOML - Details are sketchy, appears to be used internally by Genomic
Solutions which acquired Proteometrics, the originators of BIOML. Here's the
most recent references I could find:
http://www.biomedcentral.com/1471-2105/5/25
http://www.genomicsolutions.com/showPage.php?title=Data%20Integration

> The problem I have with Fasta format (other than the tendency of many
> data-providers to over-load the header line) is that there's no
> explicit marker for the alphabet and encoding of sequence data.  This
> is pretty nasty for codebases like BioJava which want to present a
> richer view of sequence data than just a String.  I'd certainly be in
> favour of a nice XML format that made alphabet information explicit.
> The DAS 1.5 DASSEQUENCE document has a moltype attribute which
> supports this (at least the three most important cases, DNA/RNA/
> Protein -- there's not a standards-compliant way to add other
> alphabets though).

Various data providers take all sorts of liberties with fasta sequence,
e.g., sequences with no IDs, whitespace-containing IDs, space between the
'>' and the ID, etc.

We might consider proscribing some conventions for what DAS considers proper
fasta format. I put in a little bit of description of a DAS-acceptable fasta
format here in the retrieval spec:
http://biodas.org/documents/das2/das2_get.html#sequence

Do we want to add more to this? Perhaps something about an optional
description being separated from the ID by whitespace and consisting of any
amount of free-form text.

Steve

> I guess an alternative, more classically RESTful, way of doing things
> might be with MIME types:
> 
>         Content-Type: application/fasta; sequence-alphabet=DNA;
> sequence-encoding=IUPAC
> 
> I admit I'd prefer the XML though...
> 
> 
>              Thomas.
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2