Bioperl: CORBA vs XML

Ewan Birney birney@sanger.ac.uk
Wed, 12 May 1999 18:55:28 +0100 (BST)



This email is just to layout my ideas about CORBA and XML. CORBA and
XML are both touted in bioinformatics as being solutions to the
integration problems in this field, from the "domain wide integrated
database" to "if only I could convert format XXX to format YYY then I
would be able to do this real cool analysis". It is pretty long in
retrospect, but at least for me was worth writing (I hope it is worth
reading for you). I am looking forward to the feedback which I hope this
generates.


CORBA vs XML
------------


First off lets make it clear that neither technology provides a 'magic
bullet' for interoperability, just as using C++ doesn't guarantee code
reuse. They are just tools to use in this area which are more
effective than recreating everything from scratch.

Here is a thumbnail sketch of the two technologies:

CORBA is a collection of standards that provide inter operation:
generally between processes, often running on different machines. A
CORBA object is defined using Interface Definition Language (IDL)
which has a object orientated syntax, and only defines the methods one
can act on a particular object. An Object Request Broker (ORB) is
required which are available from commercial and free software sources
to use CORBA.  The IDL can be compiled to be used with many different
languages, or for run time languages, (eg Python) can be discovered at
run time.


XML is a tidying up of the SGML standards for text markup. ASCII text
data can have inserted tags which describe the data (the most famous
SGML type is HTML). An xml stream carries within it the tags and often
this is enough to indicate what the data means: however a Document
Type Definition (DTD) language allows one to define what tags are
valid and how they can be used together. There are parsers for XML and
DTD available in a number of languages.



There is also a big psychological difference between the two approaches.


CORBA is a programmer heavy, industry heavy set of standards, which
very definitely aims to solve problems "the right way". This generally
means alot of complexity up front and coupled with a heavy reliance on
programming experience, makes CORBA difficult to understand to the
'hacker/power user' (and even to the seasoned programmer). However, in
general the CORBA standards are a good way to solve some pretty thorny
problems and the fact that the difficult issues are addressed early on
makes for good long term stability. CORBA's 'best fit' is to Java, and
Java2 will come with its own ORB, making Java/CORBA applications even
more seamless.


XML is very web driven: XML bootstraps the success of the http
protocol to actually transport it and XML is an inherently accessible
data source, being easy to 'read and understand', and so get stuff
done quickly. XML's culture is close to Perl and Web in terms of how
standards are made. XML is just a data definition: actual methods
are made by coupling routines to a web server and thus many of the
difficult problems that CORBA tackles head-on XML just passes them
directly onto the implementors without any help. In *most* cases,
these difficult problems need not be solved well for thrown together
systems, but you would not want to run the subsystem of a fighter pilot
on a cookie system with a http server (something which CORBA does want
to do).



Both technologies have a clear role to play in a number of different
areas, most which don't overlap: for example, CORBA is being used for
communication in the Gnome and KDE desktops developed for linux/unix
OS (being the open standards equivalent of DCOM/COM). XML (or rather
SGML) has had a long history in technical documentation and corpuses
of text for dictionaries/linguistic analysis. These are clearly
distinct problems. Similarly in bioinformatics there are a series of
problems which CORBA and XML provide solutions to: if you wanted to
expose the internals of sequencing robot in terms of its control,
CORBA would be a good choice. Conversely the biological literature is
crying out for XML. The problem lies that the main thing in this field
that we care about (Sequences and Sequence related information) is not
a clear problem to be solved by either technology. And there is the rub.


So - Here is the CORBA case for Sequence objects:
-------------------------------------------------

o CORBA exposes methods not data, which is what you want, as many sequence
objects might be constructed 'virtually' from other data sources, and you
need a method only interface to them.

o Allowing methods declaration allows different views on the same object
to be kept in sync and valid.

o Transport (of a particular item) is efficient, with the potentially
for well optimized marshalling/demarshalling.

o IDL provides strong typing rules to enforce that only 'sensible' things
can be done with the objects

o CORBA tackles head-on the problem of object identity.


Here is the anti-CORBA case:
---------------------------

o CORBA is an inaccessible technology for the majority of programmers
in this field (learning curve is too steep, no good Perl ORB - yet).

o Construction of servers in CORBA is difficult and time consuming

o IDL definitions places too many restrictions on how to use the data,
and in general people tend not to expose things which they should do,
make the objects pretty useless.

o Transport of the whole object requires visiting each method in turn,
which ends up being a big hit. (Objects by value partially solve this
but - no one has implemented it, and if you are serious about using
methods on objects, this doesn't solve it).


Here is the XML case:
---------------------

o XML is a natural step up from HTML making the learning curve 
very shallow

o Servers are based (generally) around http and well established
and extendable servers (apache) are there for the using.

o XML contains all the data which the server wants to publish, and
in general in XML, the tendency is to put everything into the data
stream, letting the client decide what is wants to do with it.

o A single request gets the entire object in one go. 


Here is the anti XML case:
--------------------------

o XML parsing is an inherently expensive marshalling/demarshalling
process.

o The tendency with XML is not to send the data as the underlying
stored object, but as a sanitized view. A good example of this are
CDS, exons and introns on DNA sequences, in which XML would tend 
to be written out explicitly, despite the fact that they are interlinked.

It is very easy when sanitized views of data are given to have valid XML
which are invalid objects by other criteria (intron overlaps exon). Sending
data which can only be unambiguously used generally is way too low level
for the receiving client: to make sense of the data it needs methods.

Basically, for complex objects either the client or the server have to
implement methods and XML neither helps the designers for implementing
these methods and if methods are implemented before the XML generation
(which in many cases will be the case), then the XML is generally
looser than most people would like, suggesting paranoid clients who
need to recheck XML data sent to them on other criteria.

o In XML large objects either need specialist servers to navigate over
them sensibly with the client (in which case building these servers
become a pain, and you are back to a methods definition problem which
XML doesn't provide any solutions for) or just send the *whole object*
to the client.

There is alot of information attached to a DNA sequence (in particular
genomic sequence) so this a clear issue.

o XML ducks object identity problem (how can I get this object again?),
leaving it up to implementors to solve case-by-case.
  


Where does this leave us?
-------------------------


   I can't see CORBA being the dominant inter-group distributed
technology.  The learning curve is too steep, and in this field we
need more people who are biologically on the ball to be involved. If
we are serious in this field in some how sharing our results as more
than publishing literature or html pages, then it is going to be by
XML.

  There are huge pitfalls to XML as it just ignores a number of
difficult issues that will be there. I can't see XML
'interoperability' being a sensible solution in a large project (eg
database/analysis pipeline).  In this kind of problem, CORBA gives
more support to the programmers for the hard problems and much better
performance/flexibility in the solution.

  So take home message:

  XML for internet/inter-group/data orientated use, but it is loose and
the problems in using it haven't been fully realised.

  CORBA for LAN-wide or 1-3 group interoperability for use by
serious programmers solving tricky problems.



What is bioperl's role in this?
------------------------------

  Bioperl's role in my view is to work with as many things out there
which are useful as possible. This means both XML and CORBA. In XML,
as people on this list know, the BIOML is a good stab at defining
sensible sequence object (http://www.proteometrics.com/BIOML/). In
CORBA, the EBI have provided CORBA access to their database
(http://corba.ebi.ac.uk/). There is a domain task force (basically a
bunch of companies plus the EBI) who are defining a more industry wide
set of standards in CORBA for this field (the LSR group).

The 'heavyweight' sequence object that we are designing at the moment in
bioperl will make use of both of these resources so that it can
read/write BIOML (or another standard) and be loaded up from the CORBA
server at the EBI. (BTW - if you are interested in the design of this
object, pop over to the guts list to see our first stab at the
design).

  The other role for bioperl is to be the talking shop for XML definitions.
Alot of people are using XML but somehow, despite the fact that they
want to make it general, not discussing it openly (at least not when 
I am around!). I think this mailing list is a great place to discuss
DTD definitions and thrash out what data we want to share between groups.
It is important for the XML people out there to start talking about what
they are planning.
  
  I am sure NCBI have a big role to play in the XML area because their
core technology (ASN.1) is very close to XML. It will be great to get
their input on these things (I am sure they have encountered many problems
which should be avoid in the future) and the best standards to come about
by cycling between discussion, testing and serious use. Does anyone know
who should be contacted in NCBI about this? Are there any ncbi lurkers on
the list who want to chip in?


  I would be really interested in anyone else's view on this (I am sure
I don't have the best view on it) and would like to encourage people,
in particular the people involved in XML efforts to air their ideas.



  Apologies for the length of this email. ;)


ewan


  



Ewan Birney
<birney@sanger.ac.uk>
http://www.sanger.ac.uk/Users/birney/

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================