[DAS2] Sequence retrieval proposal
Andrew Dalke
dalke at dalkescientific.com
Sun Dec 11 13:40:46 EST 2005
Thomas:
> Do GAME or GFF have a sequence representation? I thought they were
> both primarily feature-table formats (right now I'm having trouble
> finding the GAME documentation though...).
Others followed up on this.
For me, I was confused. Even though Steve said "sequence retrieval" --
in the subject even -- I was thinking of feature formats.
I think that came to mind because I expect there to be more feature
data transfered than sequence data, so if data corruption is a concern
then the annotations are more likely to have problems.
Or I may have been thinking about some of the formats (Genbank,
swissprot)
which combine the two, and have a checksum.
I still don't think checksum-identifiable data corruption is something
we need to worry about.
> The problem I have with Fasta format (other than the tendency of many
> data-providers to over-load the header line) is that there's no
> explicit marker for the alphabet and encoding of sequence data.
*sigh* It seems like this never goes away. Biopython also has a "rich"
alphabet property, designed to handle alternate alphabets, like
3-letter codes
and secondary structure alphabets. Bioperl's seems more appropriate in
practice - dna, protein, rna, and perhaps 'unknown'.
In the context of DAS, this is not a problem. DAS 2.0 uses only genomic
data, so all FASTA records will be of type 'dna'.
It might be different with structure data where a single record may
have all three alphabet types. (Though I only know of structures with
2 of the 3.)
> I guess an alternative, more classically RESTful, way of doing things
> might be with MIME types:
>
> Content-Type: application/fasta; sequence-alphabet=DNA;
> sequence-encoding=IUPAC
>
> I admit I'd prefer the XML though...
As I mentioned, for purposes of DAS 2.0 this isn't needed so I
don't think we need to solve this problem.
If we do, I think it's a nearly intractable problem. How does one
register all the different possible alphabets? IUPAC dna/rna/protein
covers most of it. Getting the other few percent is hard. Then
making all the software to preserve or interconvert the different
formats adds another layer of hard. There's a lot of social issues
as well.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list