[Dynamite] new test xml: protein-smith-waterman.xml

Mon, 31 Jul 2000 13:17:33 -0700 (PDT)

On Mon, 31 Jul 2000, Guy Slater wrote:

> > >     o (telegraph == Moore) && (dynamite == Mealey) ??
> > >       (or vice-versa - I can't remember)
> > >       The advance and param tags are with the transitions,
> > >       but they way I've put them, they repeat on a per-state basis,
> > >       which seems pretty pointless and verbose.
> > >       How should this bit be done properly ?
> > >       What are the pros and cons of moore vs mealey ?
> > >       (I thought we'd discussed this but couldn't find it in the archive)
> > > 
> > >     o Is the way I've done gap_open and gap_extend correct ?
> > >       Why are these vectors not scalars ?  It looks really silly.
> > 
> > As regards both these points:
> > It is not meant to be hand-generated XML (except for these
> > low-level tests), nor is it meant to look pretty. Do not worry if it looks
> > wasteful. Remember that we plan to have a higher-level XML (and object 
> > model!) at some point to abstract some of these things. In particular
> > 'calc' expressions (i.e. scalar gap_open etc); don't try and anticipate
> > these too early
> 
> That isn't really why I don't like it.
> 
> I don't think there is much point in it being so low-level
> that I have to write code below it bring the level back up again
> (ie. to check if all the elements are the same for optimisations etc.)

At some point we have to decide which optimisations we are just going to
kiss goodbye, or at least postpone implementing. But if you think losing
this particular optimisation would be unacceptable, there is room to
incorporate it using the modified transition syntax I described later in
my mail (see below)

>  
> > Moore vs mealey: entirely interconvertible, but since there are more
> > transitions than states, you have more degrees of freedom our way.
> > 
> > >     o I still don't like the tag name "scores" being used in the
> > >       parameter assignments.  It only marginally less vague than
> > >       using "numbers" or "data".  Alternatives ?  "populate" ?
> > 
> > yes i agree. "populate" is good. i had thought of "calc" but i prefer
> > "populate"
> > 
> > Chris says he thinks we should also have a more XML-like list format for
> > within the "assign" blocks. e.g.
> > 
> >   <populate table="gap_extend">           [note "param" --> "table"; see below]
> >    <x>12</x> <x>12</x> <x>12</x> ...
> >   </populate>
> > 
> > it looks awful but we _do_ need to be able to write a DTD for this XML and
> > i'm not sure the comma-separated list can be DTD'd. if anyone can come up
> > with a better way....
> 
> Yes, the blosum62 matrix in protein-smith-waterman.xml is going to look
> nasty like this.  I think if we want to write a DTD, this can just be CDATA.
> 
> Even doing this won't convey the LSF ordering (or whatever it was).
> Similarly, the char ordering in <table> is important, and I'm not
> sure if or how we write a DTD for that.
> 
> I reckon we should probably just leave this one for the moment,
> and worry about it when we come to write DTDs.

OK

> 
> > i think we should definitely not regard the tagnames as set in stone yet
> > (so don't embed them in your code (this should go without saying ;-)))
> 
> Um. Hahaha - yes of course it is all pretty table-driven code.
> Maybe we start code review at somepoint ;)

well, it's all up to you how you implement, but just be aware that we are
probably going to have to refactor quite a few changes before this thing
is out (i know you don't like the word "refactor" ;) )

>  
> > >     o Similarly, I don't like the use of char and chars.
> > >       Are we limiting alphabet sizes to 256 ?
> > >       Maybe 'character' or 'symbol' ?
> > 
> > "symbol" would be good, i think.
> > if we want to not restrict the alphabet size we should change the
> > declaration syntax to e.g.
> > 
> >  <alphabet name="protein">
> >   <symbol>A</symbol>
> >   <symbol>R</symbol>
> >   <symbol>N</symbol>
> > ...
> >  </alphabet>
> 
> Yes, I think this would be a good idea.

good

>  
> > other than that i think it's good.
> > 
> > here are some other tagname changes i think would be good:
> > 
> > (at top level) "index" --> "table"
> > (within "populate") "param" --> "table"
> > (within "transition") "<param name=''>" --> "<index table=''>"    [see below]
> > 
> > ("table" is a Haskell-ish name for a multidimensional array)
> > 
> > also i would like to change the transition block around a little. this is
> > the only non-cosmetic change. what i would like is to separate out the
> > lookback from the table indexing. this means specifying it (the lookback)
> > twice, but i think it will avoid confusion in the long run, especially
> > when we start to use polymer HMMs.
> 
> ... it's not avoiding confusion in the short run ;)
> Can you give an example of when they should be different ?
> Otherwise I think we should avoid redundancy in this XML.

This addresses exactly the issue that you raised before (too many elements
in e.g. a simple gap penalty)

do you see what i mean? either:

 (1) you always specify the maximum number of elements in any score
     matrix, with a fixed indexing order (e.g. query-target)... OR

 (2) you allow more compact score matrices, and you specifically tell the
     engine which things are used to index the matrix

so, the easiest example is your one about having to copy out the gap
penalty 20 times for protein-SW (with my proposed change, you'd just leave
out the '<symbol chain="query" offset="0"/>' tag from the <index> block
within <transition>)

another example would be if you had a polymer HMM, and you wanted to use
the same emission spectrum for each insert state. Rather than copy out the
same emission spectrum for every unit of the HMM, you just write it once

So there are two direct benefits: compactness of XML (basically
irrelevant) and a slight optimisation gain (because you don't need to do
unnecessary index lookups)

I also think it's more elegant, because there is less "hidden" information
about the way the engine constructs tuples with which to index tables.

We should talk by phone about this if you're unsure

>  
> > i have committed some changes into dna-edit.xml rather than spell them all
> > out in detail here (we can always rewind) --- please tell me what you
> > think
> 
> I think doing that instantly breaks a lot of code and tests.
> The xml in CVS head should be stuff we have agreed on.
> I'd rather you just put proposed changes to XML comments or something.

OK, sorry.

Does it break code/tests? I was under the impression you hadn't written
any code yet. Although I still take the point; the XML should be regarded
as sacrosanct, since messing with it fucks people around...

Do you want me to rewind? (i'm actually not sure how to, but i'm sure i
can figure it out)

> 
> I'm OK with the cosmetic changes,
> but not so sure about the offset/step stuff.

let's talk by phone.
i do take your point about changing the xml. i won't do that again without
consulting.

> 
> Hmmm.  This reads like a moody email.  Does that make it XP ? ;)

probably... (best Bruce Lee voice) "give me emotional content, not anger!"

ian

> 
> Guy.
> --
> %!PS % <------ Guy St.C. Slater ------> http://www.ebi.ac.uk/~guy/  <------
> 210 297/a{def}def/b{translate}a b 36/c{rotate}a c 0 1 0 1 12/d{exch moveto}
> a/e{closepath stroke}a/f{index}a/g{0 0 0 0 4 f}a/h{setlinewidth newpath dup
> g}a{pop exch 1 f add 0 h neg d lineto 72 c lineto e 2 h d 3 f 0 108 arc d e
> 18 c 0 2 f neg b 18 c}for 72 c newpath add g 0 7 arc d e pop showpage
> 
> 
>