[Open-bio-l] SeqXML an alternative for FASTA

Tue Jul 5 10:57:37 EDT 2011

Hi,

Thanks for the feedback!

On Jul 4, 2011, at 1:41 PM, Peter Cock wrote:

> On Fri, Jul 1, 2011 at 8:57 AM, Thomas Schmitt <Thomas.Schmitt at sbc.su.se> wrote:
>> Hello everybody,
>> 
>> We recently published a new XML format called SeqXML to store biological
>> sequences. Our aim was to create a lightweight alternative to FASTA that
>> allows to store the metadata that is typical squeezed into a FASTA header
>> in a standardized way.
>> 
>> It looks something like this:
>> 
>> <seqXML speciesName="Homo sapiens" ncbiTaxID="9606" source="Ensembl">
>>    <entry id="ENST00000308775" >
>>        <description>dystroglycan 1</description>
>>        <RNAseq>AAGGCGAUGUC.....ACAU</RNAseq>
>>        <DBRef type="DNA" source="RefSeq" id="NM_004393"/>
>>        <property name="prediction_method" value="manual curation"/>
>>    </entry>
>>    <entry id="ENSP00000312435" >
>>        <AAseq>AAGGCGAAA...CACJOXA</AAseq>
>>    </entry>
>> <seqXML/>
>> 
>> Check out the paper at http://bib.oxfordjournals.org/content/early/2011/06/10/bib.bbr025.full?keytype=ref&ijkey=dWzLPFBuzrdZme8
>> 
>> There is also a website (http://seqxml.org) where you can find the schema and a
>> detailed documentation. The whole thing emerged from developing formats for the
>> orthology community so you will also find information about our orthology format
>> OrthoXML at these resources.
>> 
>> 
>> To my knowledge the only format comparable to SepXML is TinySeq which does
>> have some significant limitation:
>> 
>> - It doesn't support database cross referencing
>> - The identifiers are more NCBI specific
>> - It is more verbose
>> - There is only a very primitive DTD
>> - It doesn't allow to validate the sequence alphabet
>> - It isn't possible to define the source of the sequences
>> - It doesn't support key value pair annotations
>> 
> 
> Thanks for the comparison to TinySeq. Did you find a good introductory
> document for this file format?

Not really, the only thing I found was the DTD, a very general document, and some examples.

http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.mod.dtd
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt
http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&qty=3&c_start=1&list_uids=D12625,D42072,M82814&uids=&dopt=tinyseq&dispmax=5&sendto=

>> 
>> We are trying to get IO implementations for SeqXML for all Bio* projects.
>> 
> 
> That would definitely help with getting people using the format.
> 
>> 
>> There is already an implementation in BioPerl maintained by Dave Messina.
>> We do have an implementation for the legacy version of BioJava and Andrew
>> Yates promised to help us migrating it into BioJava 3.
> 
> That sounds promising.
> 
>> I'm also in contact with Peter Cock about a Biopython integration. He in
>> fact asked me to move the discussion to this list.
> 
> :)
> 
> Note we're using the format name "seqxml" in Biopython's SeqIO to match
> what was used in BioPerl's SeqIO.
> 
>> 
>> What do you guys thinks about the format?
>> 
> 
> I'm wondering about the predefined allowed character sets for DNA, RNA
> and Protein, and if they are overly prescriptive for some special use-cases.
> Extra symbols are sometimes included for things like frame shifts, or to
> indicate different stop codons.
> 
> Related to this, what about things like modified RNA (a vast alphabet),
> or color space (used in the ABI Solid Sequencing platform)?
> The simple answer is these are out of scope ;)

Right now SeqXML supports 3 different alphabets. These cover the basic use-cases and shouldn't be changed.
But one can easily add more alphabets for special purposes in the form of different sequence types. 
What comes into my mind apart from the above mentioned are quality values and RNA secondary structures. 
Because the sequence type is not defined at the entry level adding new types is backwards compatible. 
Having these different sequences one might also want to allow more than one sequence per entry.
I do however think we should be careful with adding new features. We don't want to cover every possible use-case 
and end up with a format monster. Our goal was to create a simple format that fulfills the typical needs for FASTA.
The question that remains to be solved is what is typical.
Another issue that I see is API support. Do all Bio* API support such special alphabets?

> 
> However, the main missing feature for me is a feature table as in the
> GenBank, GenPept, EMBL, SwissProt etc flat files, and also represented
> in some way in their XML equivalents:
> 
> http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd
> http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.mod.dtd
> (I haven't found the details of the feature tables/sets yet)
> 
> http://www.uniprot.org/docs/uniprot.xsd
> http://www.uniprot.org/docs/xml_news.htm
> (Biopython already has a parser for the UniProt XML format, including
> the features.)
> 
> Clearly there is overlap here with GFF3 as well - so this is a potential
> mine field of compatibility issues. Again, the simple answer is features
> are out of scope.

SeqXML supports simple features in the form of key-value pairs. Rich position specific feature tables
are something for full blown record formats like the ones you mentioned, which we are clearly not trying to create.
So in short I would say out of scope.

> 
>> 
>> Is there anybody who wants to contribute with a BioRuby implementation?
>> 
>> Best regards,
>> Thomas
> 
> I've also CC'd Peter Rice to ask if SeqXML is something EMBOSS would
> consider supporting?
> 
> Regards,
> 
> Peter

Cheers,
Thomas