[BioSQL-l] Re: Postgres version support

22 Oct 2002 15:33:05 -0600

"Chris Mungall" <cjm@fruitfly.org> writes:

Let me be clear up front. The whole reason I created the XML approach
was to limit the expressivenes of generic SQL - it was *too*
expressive, both in syntax and semantics.

I needed to constrain what possibilities I could use to make it easy
to generate meta-information about a repository.

I didn't do the XML format because it gave me a better DB, I did it
because it enabled me to auto-generate a whole suite of tools based on
the schema. 

Personally, I have too many open source modules to maintain. Who wants
to go and hand-edit a gizzilion Perl modules every time our data model
needs to add some relationships? 

That's a phenomenal waste of my time. I'd rather be writing cool new
tools. 

> [*] However, I'd really like to add a generator for your MAGE-schema-xml
> to the aforementioned CPAN parsing module (Actually I don't think the
> module is on CPAN yet, but it will be soon - I'm cc-ing the developers
> list). This will allow you to slurp any SQL schema out there into MAGE
> land, including bioSQL. I think the results should be pretty cool.

Great! Let me know how I can help.

(Further thought provoking ideas inline)

> Personally I prefer an SQL syntax for hand-generating, editing and
> browsing relational schemas. It's a bit harder to parse (not too
> hard with Parse::RecDescent), but that's taken care of with a
> standard CPAN module [*].

Right, the issue is limiting yourself to a given syntax and sticking
to it. The SQL parser you're talking about (sqlfairy?) presumably
implements partial SQL handling - so as long as you stick to what it
can handle you're fine.

Likewise, I wanted to identify only those bits of SQL DDL that were
really critical for me:
* columns
* data types
* primary keys
* unique indices
* foreign keys
* views

The view support is pretty minimal currently, because that's all I
needed. 

> I think your can have the SQL syntax be fairly DB independent, so
> long as you stick to SQL 92. There's always weird datatype issues
> like mysql mediumtext but then switching to XML doesn't solve this.

Probably. Using the XML format forces it to be true.

I think there is also the issue of representation. When people see
SQL, they expect it to behave like SQL. When people see a table
definition in XML they expect it to be a table definition. I'm not
sure this is an important issue, but I believe the separation of
implementation from syntax is a good one.

> As well as supplying an alternate syntax to SQL CREATE TABLE
> statements, your DTD also seems to be offering a slightly different
> semantics to the relational model. This makes perfect sense for the
> genex object-centric approach. You want a more object-y kind of
> semantics, including inheritance, and you are less concerned with a
> lot of the relational semantics you lose out on.

Not sure I lose out on any semantics. I am sure that there are
semantics that I haven't bother to implement, but that doesn't mean
they cannot be expressed, or are you think of something in particular?

The inheritance is not a central feature, it's something we started
playing with recently, and it's not all that well supported.

*The* critical feature is cardinality support.

> However, I'm not sure I understand your point about cardinality -
> you can specify any kind of cardinality with the relational model
> using foreign key and uniqueness constraints. In fact with check
> constraints you can do a lot more powerful stuff too, such as
> specifying "table 'car' has a 1:4 relationship with table 'wheel'".

Yes, I'm sure it's possible to do that, I'm not saying that the XML
format lets you do something you cannot do with SQL, in fact the XML
format is fairly limiting (as you've pointed out). What it does do is
provide an explicit syntax to describe the intended cardinality, and
this is either tremendously much harder or impossible with a general
SQL parser.

Just seeing that table A has a foreign to table B tells me nothing of
the cardinality of that relationship. In order to implement an object
layer on top of this I have to know what the cardinality is. 

So in the XML format I specify it explicitly, 1-to-1, 1-to-many,
many-to-many (just to pick the three bigies). If you wanted you could
create new ones, and but in new code into xml2sql() to implement them
with CHECK constraints, but how many people really care? If someone
does, it will happen.

> Again, I personally prefer relational semantics, they are actually fairly
> rich and expressive once you step out of the MySQL world [**]. 

Exactly, it's *too* expressive.

> Personally I prefer bioSQL to ptake a more relation-centric
> approach, although there are plenty people contributing more commits
> than me who prefer a more object-centric approach (I think).

This is where we disagree.

I care a lot about the world of WWW services, so that means I don't
give crap about what your underlying schema looks like, I want your
data, and I want it now.

So I'm not going to be using BioSQL to move data around the 'net, I'm
going to be using objects serialized as XML. Those objects have to
have a formalized definition in an XML-schema (I'd love to use
RELAX-NG, but the tools aren't there yet, so it's got to be W3C XML
schemas for now).

And (this is where MAGEstk comes into play) if there's a higher level
UML definition to those objects, I can auto-generate a complete
architecture for your data:

* DB retrieval =>
* object layer =>
* XML serialization =>
* SOAP transport =>
* XML de-serialization =>
* object layer =>
* DB storage.

I don't have to write/maintain any of this by hand.

That seems pretty useful and powerful to me!

So, yes, the relational schemas I auto-generate in MAGEstk are
probably not anything compared to what you could do by hand -
*because* the schemas are not so critical to me, the framework is.

So, that's why we let people do it by hand, and write DB adaptors to
map from the DB to the object layer. One way of doing this is using
the XML format, and then I can still generate a low-level API that is
DB specific (currently that's all Genex has). 

Then you can write a mapping between the DB object layer (the Genex
API) and the higher level objects (the MAGE API), and then MAGEstk can
generate the DB adaptor code for you. This is the part of MAGEstk that
isn't finished, and we need to work on during the Nov programming
jamboree in the Bay Area.

> As an afterword - If we want to move to an even more expressive
> semantics than the relational model, then I think a description
> logic (DAML+OIL, or OWL) is the only way to go. I think Matt and
> Thomas are quite keen on this, I am too, but this is definitely much
> more research-y, as the tools just aren't there yet.

This is likely to be very fruitful. I'm just monumentally ignorant of
DAML+OIL as well as RDF/RDF schema. There's a grad student in Colorado
that wants to add support to the MAGE XMLWriter to output MAGE-ML in a
format that is more condusive to expression knowledge, read
relationships between the data, and joining related bits by RDF. 

Cheers,
jas.

PS. I'm thoroughly enjoying the discussion, thanks for getting me
thinking!