[Dynamite] compile status / radical idea

Sun, 16 Apr 2000 14:07:53 -0700 (PDT)

Ewan,

Thanks for your considered mail. First off I think one theme that is
emerging is that we are all slightly uncomfortable with the current 
paralysis, and that the constructive thing to do is identify the
bottlenecks.

Three options are on the table: (1) keep going with IDL-to-C; (2) go Perl,
possibly mixed with C; (3) go pure C. I am continuing to argue for (2),
for reasons described below. I am also sympathetic to (3).

> i have been dreaming code last night, and I sort of realised that
> *internally* in the telegraph package we only "virtual
> function"/implementation flipping in a limited number of areas 
> 
> - getting sequences out of a database (but not sequences themselves)

I thought you were keen on sequences being virtual too. Do you regard this
as less crucial now that you've implemented virtual contigs for EnsEMBL?
(just curious really...)

> - running the algorithms either run-time or compile-time.
> 
> - perhaps some training code.
> 
> Everything else neither needs run-time method binding nor that much
> inheritance.
> 
> 
> So - rather than moving to Perl (drawbacks in my book -
> 
> 	a) hard to maintain a large Perl code base - look at ensembl

Actually, I don't think this *would* be a large Perl codebase. This
project is well-contained, and our object model is already laid out.
I think we could do it in a dozen or so smallish modules. Probably less
code than "idlstubs.pl". (And probably quicker to write.)

> 	b) execute heavy pieces going run like a stuck cow

Yes, but the Perl implementation is proof of concept only. We'd have two
options to improve performance:

	(1) port DP routines to C
	(2) autogenerate C (c.f. original Dynamite) - VERY easy using Perl

> 	c) guy wont do anything

;-)

I had hoped that Guy would be interested in converting parts of the
package from Perl to C. The DP algorithms, for example.

What makes this idea so attractive to me is quick publication. Let me
elaborate, then you can shoot me down if you disagree...

I/we can write Telegraph in Perl *very quickly*. We are talking about a
matter of days here. OK, so it runs slow, but we have proof of concept of
everything - the whole object model, the idea of polymer HMMs, the
parameter space translation, the training code. _Everything_.

We then start to port parts of it to C, using the same object model as for
the Perl. (The original Perl version must be so object-oriented that it
has a halo.) We can even mix Perl & C initially, using XS. We can aim to
eventually implement the entire library in standalone C, or just the DP
algorithms, or whatever is feasible. There is no shame in leaving the
training algorithms in Perl, because the training code can be decoupled
from the DP code very easily. It is entirely feasible for the training and
the DP code to communicate by means of XS calls, or over sockets, or even
through temporary files: the only object that passes from the DP phase to
the training phase is a Param::Value::Buf, which is easily serialisable.

We can work in parallel. No bottlenecks, and we can write a paper at any
stage, because we have 100% proof of concept: a working Perl program. I
hypothesise that a useful division of labour would be for me to do the
initial Perl implementation, perhaps with Ewan. Then Ewan and Guy could
take over the porting to C, while I could either write more Perl (e.g.
experimental training code, XML I/O) or help with the C port.

As soon as we publish, we can go all-out Open Source, i.e. publicise the
mailing list, give away bottles of champagne, etc etc. Perhaps people will
even help us with the C conversion.

Being able to publish early, even just a poster at ISMB, is *very*
*attractive*. It will really get the ball rolling; a collaboration with a
publication to its name is collaboration that has come of age.

> 
> I suggest - 
> 
> 	Using "Standard" C methods, with some pointer-to-funtion for
> database streaming/database access, algorithm implementation to allow
> compile time code coming in cleanly and possibly training interface.
> 
> I have a clean sequence stuff already with pointer-to-function for
> database streaming. I can bind these via CORBA to bioperl.
> 
> 
> What do people think?

I'm not completely sure I follow you. Are you proposing abandoning our IDL
object model but sticking with C?

If so then I guess this would certainly remove the IDL-to-C bottleneck
that arguably has contributed to our current paralysis. We would be
throwing out a few babies with the bathwater though...

	(Baby #1) Yes we are only making sparing use of inheritance and
		  dynamic binding, but IMO the main advantage of
		  "object-oriented C" is having a logical object model,
		  making the library nicer & more logical to use.
	          Our IDL-to-C mapping enforces this.

	(Baby #2) The formality of using an IDL-to-C mapping also provides
		  for future scenarios such as interfacing to CORBA or
		  Perl XS.

I have no interest in pushing idlstubs if you are both uncomfortable using
it. I have always been concerned that using an in-house compiler would
give people the willies, especially if it is opaque to everyone except
me.

Most of my recent work on idlstubs has been aiming towards making it more
comprehensible, by separating out the C-generating part from the IDL
parser. With these improvements, it would be straightforward for you guys
to edit to the C without having to delve into the idlstubs Perl.

I estimate the new improved idlstubs would be ready by the end of the
month, unless we abandon IDL-to-C in which case I won't work on it.

On balance, I think the bottleneck problem probably outweighs the
advantages of IDL-to-C. But I'd like to see a little more discussion on
this list first.

I still favour Perl, because I see this being the quickest way by far of
getting a working library. Dissuade me...

Ian