Bioperl: XML

Vicki Brown vlb@deltagen.com
Thu, 6 May 1999 10:43:06 -0700


The BioPerl list hasn't mentioned XML since January... The message below
was forwarded to me.  What is the current view/status in the BioPerl
community as regards XML?  There was talk of a BoulderIO <-> XML convertros
as well as a CGI <-> XML converter.

I can't agree with the assertion that XML will result in
"(No more perl-parsers for >BLAST-output!!)" But I thought this was worthy
of bringing up on the BioPerl list.

With the permission of Mr. Loeffler:

-----Original Message-----
>From: Gerald Loeffler <Gerald.Loeffler@vienna.at>
>To: Computational Chemistry Mailing List <chemistry@infomeister.osc.edu>
>Date: Friday, April 30, 1999 4:00 AM
>Subject: CCL:XML for Bioinformtics Data

>Hi!
>
>Recently, I've been working a lot with XML (see http://www.w3c.org/xml/
>and e.g. http://www.ibm.com/xml/), which is a standard, human-readable,
>extensible markup-language that is rapidly becoming _the_ method of
>choice for exchange and storage of any kind of data and documents. It
>seems to me that XML would simply be _perfect_ for data exchange and
>maybe even data storage in bioinformatics (see end of message for a note
>on chemistry and CML).
>
>E.g. (from the top of my head), a DNA/protein sequence similarity search
>engine (e.g. NCBIs BLAST server) might return its search results in the
>form of an XML document that
>could look like this:
>
><seq-sim-search-results>
>  <query>
>    <type>                         protein     </type>
>    <seq name="My stupid peptide"> GAVLIFYWSTQ </seq>
>    <algorithm>                    FASTA3      </algorithm>
>    <db>                           SwissProt   </db>
>    <gap-open>                    -12          </gap-open>
>    <gap-extension>               -2           </gap-extension>
>  </query>
>  <hits>
>    <hit>
>      <accession>      HPS_HUMAN    </accession>
>      <organism>       homo sapiens </organism>
>      <overlap>        11           </overlap>
>      <overlaping-seq> GAEVLFYWTDQ  </overlaping-seq>
>      <z-score>        129.3        </z-score>
>    </hit>
>    <hit>
>      <accession>      PA24_MOUSE   </accession>
>      <organism>       mus musculus </organism>
>      <overlap>        8            </overlap>
>      <overlaping-seq> VFIFYWTT     </overlaping-seq>
>      <z-score>        133.3        </z-score>
>    </hit>
>  </hits>
></seq-sim-search-results>
>
>There are several important points here:
>
>1) Without knowing what this XML document is about, a program can assert
>that it is well-formed! These programs exist, are free and are
>applicable to all XML documents!
>
>2) The rules for the nesting and naming of the tags in XML documents of
>this type can be formally defined in XML. The above document would be of
>type "seq-sim-search-results" and you could easily write a formal
>definition (in a DTD file) that says that such a document must contain a
>"query" and a "hits" tag; the "query" tag in turn must contain exactly
>one of each "type", "seq", ... The "hits" tag in turn may contain 0 or
>more "hit" tags which in turn ...
>
>3) Having a formal definition of documents of this type, a program can
>verify that our above XML document complies with the formal definiton
>(is valid). These programs exist, are free and are applicable to all XML
>documents!
>
>4) Free utilities exist (e.g. IBMs xml4j) that can programmatically
>write and read (parse) any XML document and thus give a program access
>to the structure and content of the document!! (No more perl-parsers for
>BLAST-output!!)
>
>5) This file is human-readable! (in contrast to a Corba struct or a
>serialized Java object!)
>
>6) Modern WWW-browsers can (if a style-sheet is supplied) directly
>display this XML document. For old browsers, the XML document can easily
>be converted to HTML for display.
>
>I think you get the idea.
>
>Does such an XML-based approach sound reasonable?
>What does this approach leave to be desired?
>Are efforts underway in this direction?
>Wouldn't it be a better world if we all used XML (-:
>
>I know that XML is currently being used for chemistry-related data (CML,
>see http://www.xml-cml.org/), but I haven't heard of any efforts in the
>area of Bioinformatics. So please view this message as targeted towards
>the Bioinformatics community that is not served by CML. (CML has a
>DNA/protein sequence tag.)
>
>        cheers,
>        gerald
>        cheers,
>        gerald
>--
> Gerald Loeffler
> Email: Gerald.Loeffler@vienna.at
> Smail: Apollo Imaging, Marchettigasse 7, A-1060 Vienna, Austria
> Phone: +43 676 3289588 (+43 1 5952333 27)
> Fax:   +43 1 5952333 20
> Keywords: Java, CORBA, OOA&D, Databases, Bioinformatics,
>           Computational Biology, Computational Biophysics
-----
 //=\   Vicki Brown <vlb@deltagen.com>
 \=//    Journeyman Sourcerer: Scripts & Philtres
  //=\    (Mac)Perl, awk, sed, *sh..., occasional C
  \=//     A little web-gardening on the weekends
   //=\
   \=//      Deltagen, Inc.
    //=\     1031 Bing St, San Carlos, CA 94070
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================