[Dynamite] new test xml: protein-smith-waterman.xml

Guy Slater guy@ebi.ac.uk
Mon, 31 Jul 2000 20:59:15 +0100 (BST)


On Fri, 28 Jul 2000, Ian Holmes wrote:

> On Fri, 28 Jul 2000, Guy Slater wrote:
> 
> > 
> > I've started telegraph coding in earnest, but I won't be checking
> > stuff in until I have automake/autoconf working properly
> > (probably monday).  Automake/autoconf are a pain to get working
> > well across platforms, but I think in the long run it will help
> > portability and save a lot of hassle.
> > 
> > Anyway, I've just checked in an initial attempt for another
> > test telegraph model: protein-smith-waterman.xml
> > 
> > You can see it on cvs web in the test directory:
> > 
> > http://dev.ensembl.org/cgi-bin/cvsweb_telegraph/cvsweb.cgi/telegraph/Telegraph/test/xml/
> 
> The web view doesn't seem to be working. is this because sshd was just
> upgraded on adnah?

I don't think its connected to sshd, but I've asked James to fix it.
In the meantime, just check it out, I guess.
 
> > I'm pretty unsure about a lot of this, so if you
> > could both have a look over this it would be good.
> > 
> > Some questions and comments:
> > ---------------------------
> > 
> >     o (telegraph == Moore) && (dynamite == Mealey) ??
> >       (or vice-versa - I can't remember)
> >       The advance and param tags are with the transitions,
> >       but they way I've put them, they repeat on a per-state basis,
> >       which seems pretty pointless and verbose.
> >       How should this bit be done properly ?
> >       What are the pros and cons of moore vs mealey ?
> >       (I thought we'd discussed this but couldn't find it in the archive)
> > 
> >     o Is the way I've done gap_open and gap_extend correct ?
> >       Why are these vectors not scalars ?  It looks really silly.
> 
> As regards both these points:
> It is not meant to be hand-generated XML (except for these
> low-level tests), nor is it meant to look pretty. Do not worry if it looks
> wasteful. Remember that we plan to have a higher-level XML (and object 
> model!) at some point to abstract some of these things. In particular
> 'calc' expressions (i.e. scalar gap_open etc); don't try and anticipate
> these too early

That isn't really why I don't like it.

I don't think there is much point in it being so low-level
that I have to write code below it bring the level back up again
(ie. to check if all the elements are the same for optimisations etc.)
 
> Moore vs mealey: entirely interconvertible, but since there are more
> transitions than states, you have more degrees of freedom our way.
> 
> >     o I still don't like the tag name "scores" being used in the
> >       parameter assignments.  It only marginally less vague than
> >       using "numbers" or "data".  Alternatives ?  "populate" ?
> 
> yes i agree. "populate" is good. i had thought of "calc" but i prefer
> "populate"
> 
> Chris says he thinks we should also have a more XML-like list format for
> within the "assign" blocks. e.g.
> 
>   <populate table="gap_extend">           [note "param" --> "table"; see below]
>    <x>12</x> <x>12</x> <x>12</x> ...
>   </populate>
> 
> it looks awful but we _do_ need to be able to write a DTD for this XML and
> i'm not sure the comma-separated list can be DTD'd. if anyone can come up
> with a better way....

Yes, the blosum62 matrix in protein-smith-waterman.xml is going to look
nasty like this.  I think if we want to write a DTD, this can just be CDATA.

Even doing this won't convey the LSF ordering (or whatever it was).
Similarly, the char ordering in <table> is important, and I'm not
sure if or how we write a DTD for that.

I reckon we should probably just leave this one for the moment,
and worry about it when we come to write DTDs.

> i think we should definitely not regard the tagnames as set in stone yet
> (so don't embed them in your code (this should go without saying ;-)))

Um. Hahaha - yes of course it is all pretty table-driven code.
Maybe we start code review at somepoint ;)
 
> >     o Similarly, I don't like the use of char and chars.
> >       Are we limiting alphabet sizes to 256 ?
> >       Maybe 'character' or 'symbol' ?
> 
> "symbol" would be good, i think.
> if we want to not restrict the alphabet size we should change the
> declaration syntax to e.g.
> 
>  <alphabet name="protein">
>   <symbol>A</symbol>
>   <symbol>R</symbol>
>   <symbol>N</symbol>
> ...
>  </alphabet>

Yes, I think this would be a good idea.
 
> other than that i think it's good.
> 
> here are some other tagname changes i think would be good:
> 
> (at top level) "index" --> "table"
> (within "populate") "param" --> "table"
> (within "transition") "<param name=''>" --> "<index table=''>"    [see below]
> 
> ("table" is a Haskell-ish name for a multidimensional array)
> 
> also i would like to change the transition block around a little. this is
> the only non-cosmetic change. what i would like is to separate out the
> lookback from the table indexing. this means specifying it (the lookback)
> twice, but i think it will avoid confusion in the long run, especially
> when we start to use polymer HMMs.

... it's not avoiding confusion in the short run ;)
Can you give an example of when they should be different ?
Otherwise I think we should avoid redundancy in this XML.
 
> i have committed some changes into dna-edit.xml rather than spell them all
> out in detail here (we can always rewind) --- please tell me what you
> think

I think doing that instantly breaks a lot of code and tests.
The xml in CVS head should be stuff we have agreed on.
I'd rather you just put proposed changes to XML comments or something.

I'm OK with the cosmetic changes,
but not so sure about the offset/step stuff.

Hmmm.  This reads like a moody email.  Does that make it XP ? ;)

Guy.
--
%!PS % <------ Guy St.C. Slater ------> http://www.ebi.ac.uk/~guy/  <------
210 297/a{def}def/b{translate}a b 36/c{rotate}a c 0 1 0 1 12/d{exch moveto}
a/e{closepath stroke}a/f{index}a/g{0 0 0 0 4 f}a/h{setlinewidth newpath dup
g}a{pop exch 1 f add 0 h neg d lineto 72 c lineto e 2 h d 3 f 0 108 arc d e
18 c 0 2 f neg b 18 c}for 72 c newpath add g 0 7 arc d e pop showpage