[Open-bio-l] SeqXML an alternative for FASTA

Mon Jul 4 07:41:09 EDT 2011

On Fri, Jul 1, 2011 at 8:57 AM, Thomas Schmitt <Thomas.Schmitt at sbc.su.se> wrote:
> Hello everybody,
>
> We recently published a new XML format called SeqXML to store biological
> sequences. Our aim was to create a lightweight alternative to FASTA that
> allows to store the metadata that is typical squeezed into a FASTA header
> in a standardized way.
>
> It looks something like this:
>
> <seqXML speciesName="Homo sapiens" ncbiTaxID="9606" source="Ensembl">
>    <entry id="ENST00000308775" >
>        <description>dystroglycan 1</description>
>        <RNAseq>AAGGCGAUGUC.....ACAU</RNAseq>
>        <DBRef type="DNA" source="RefSeq" id="NM_004393"/>
>        <property name="prediction_method" value="manual curation"/>
>    </entry>
>    <entry id="ENSP00000312435" >
>        <AAseq>AAGGCGAAA...CACJOXA</AAseq>
>    </entry>
> <seqXML/>
>
> Check out the paper at http://bib.oxfordjournals.org/content/early/2011/06/10/bib.bbr025.full?keytype=ref&ijkey=dWzLPFBuzrdZme8
>
> There is also a website (http://seqxml.org) where you can find the schema and a
> detailed documentation. The whole thing emerged from developing formats for the
> orthology community so you will also find information about our orthology format
> OrthoXML at these resources.
>
>
> To my knowledge the only format comparable to SepXML is TinySeq which does
> have some significant limitation:
>
> - It doesn't support database cross referencing
> - The identifiers are more NCBI specific
> - It is more verbose
> - There is only a very primitive DTD
> - It doesn't allow to validate the sequence alphabet
> - It isn't possible to define the source of the sequences
> - It doesn't support key value pair annotations
>

Thanks for the comparison to TinySeq. Did you find a good introductory
document for this file format?

>
> We are trying to get IO implementations for SeqXML for all Bio* projects.
>

That would definitely help with getting people using the format.

>
> There is already an implementation in BioPerl maintained by Dave Messina.
> We do have an implementation for the legacy version of BioJava and Andrew
> Yates promised to help us migrating it into BioJava 3.

That sounds promising.

> I'm also in contact with Peter Cock about a Biopython integration. He in
> fact asked me to move the discussion to this list.

:)

Note we're using the format name "seqxml" in Biopython's SeqIO to match
what was used in BioPerl's SeqIO.

>
> What do you guys thinks about the format?
>

I'm wondering about the predefined allowed character sets for DNA, RNA
and Protein, and if they are overly prescriptive for some special use-cases.
Extra symbols are sometimes included for things like frame shifts, or to
indicate different stop codons.

Related to this, what about things like modified RNA (a vast alphabet),
or color space (used in the ABI Solid Sequencing platform)?
The simple answer is these are out of scope ;)

However, the main missing feature for me is a feature table as in the
GenBank, GenPept, EMBL, SwissProt etc flat files, and also represented
in some way in their XML equivalents:

http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd
http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.mod.dtd
(I haven't found the details of the feature tables/sets yet)

http://www.uniprot.org/docs/uniprot.xsd
http://www.uniprot.org/docs/xml_news.htm
(Biopython already has a parser for the UniProt XML format, including
the features.)

Clearly there is overlap here with GFF3 as well - so this is a potential
mine field of compatibility issues. Again, the simple answer is features
are out of scope.

>
> Is there anybody who wants to contribute with a BioRuby implementation?
>
> Best regards,
> Thomas

I've also CC'd Peter Rice to ask if SeqXML is something EMBOSS would
consider supporting?

Regards,

Peter