[Biojava-dev] Current Alphabet design and an unintended consequencefor training

Sun Jan 4 08:30:48 EST 2004

David,

I think you may be able to use solution 1. As Symbol is a superclass
(interface) of AtomicSymbol you won't be breaking any API.

- Mark

> -----Original Message-----
> From: biojava-dev-bounces at portal.open-bio.org 
> [mailto:biojava-dev-bounces at portal.open-bio.org] On Behalf Of 
> David Huen
> Sent: Saturday, 3 January 2004 10:33 p.m.
> To: biojava-dev at biojava.org
> Subject: [Biojava-dev] Current Alphabet design and an 
> unintended consequencefor training
> 
> The current Alphabet system uses the BasisSymbol to represent 
> both ambiguity symbols and symbols from cross product 
> alphabets.  This casues unintended consequences for training 
> algorithms.
> 
> The DistributionTrainerContext has an addCount(Distribution 
> dist, Symbol sym, double times) method.  When using cross 
> product alphabets, it works flawlessly when it encounters 
> AtomicSymbols from the cross-product alphabet (and these are 
> also BasisSymbols).  In the design of the DistributionTrainer 
> interface, the equivalent method addCount(Distribution dist, 
> AtomicSymbol sym, double times) accepts only an AtomicSymbol, 
> which is reasonable.
> 
> However, when training two-head distributions, it is not 
> implausible for the
> DistributionTrainerContext.addCount() to receive Symbols that 
> are not AtomicSymbols.  The most common by far would be 
> symbols emitted by gap states of form e.g (gap cytosine).  
> The current implementation of the addCount method assumes 
> that non-atomic symbols are ambiguity symbols and attempts to 
> deal with them in that manner.  Evidently it fails in the 
> above case, indeed, it fails silently.  This problem 
> currently prevents the training of PairDistributions in which 
> one component Distribution is a GapDistribution.
> 
> There appears to be no easy way of fixing this problem at the 
> level of   
> DistributionTrainerContext.  It is formally possible that the 
> BasisSymbol received by addCount is truly an ambiguity symbol 
> containing a number of symbols from the cross-product 
> alphabet of the two-head HMM model.  It is also possible that 
> the BasisSymbol represents a single symbol comprising 
> ambiguity symbol(s) from one or both alphabets that form the 
> cross product alphabet.  The two are evidently not equivalent 
> and have to be dealt with differently.  And resolving which 
> it is is potentially computationally costly for an operation 
> that is repeated very many times during training.
> 
> Even if this ambiguity could be resolved at the level of 
> DistributionTrainerContext and you knew the symbol to be one 
> of type (gap <something else>), that symbol cannot be passed 
> to a DistributionTrainer that may be capable of dealing with 
> it as the addCount method in that interface accepts only 
> atomic symbols which something like (gap guanine) is not.
> 
> Interim solutions could be:-
> 1) change the DistributionTrainer.addCount()  to accept 
> non-atomic symbols.  
> DistributionTrainerContext's addCount method will leave it to 
> the distribution trainers to sort out what to do with 
> non-atomic symbols themselves.  
> OR
> 2) add a ExtendedDistributionTrainer interface with one 
> method addCount that can accept non-atomic symbols.  
> DistributionTrainerContext's addCount method will check 
> whether the symbol it receives is atomic.  If it is, it will 
> use the standard DistributionTrainer.addCount().  If not, it 
> will determine if the trainer for that distribution 
> implements the ExtendedDistributionTrainer interface and if 
> so, call that interface's addCount method to leave it to deal 
> with the symbol.  If not, it will assume that the symbol is 
> an ambiguity symbol and deal with it in the manner it does now.
> 
> (2) is probably less disruptive to existing code and 
> interfaces.  It may be that the DistributionTrainer is a 
> better place to deal with non-atomic symbols than 
> DistributionTrainerContext since that the DT knows more about 
> the internals of that Distribution and what it can/should 
> handle while the DTC has to of necessity implement a 
> one-size-fits-all approach.
> 
> At Biojava 2, it may be worthwhile to revisit the Alphabet 
> design and explicitly distinguish ambiguity symbols and 
> BasisSymbols on the level that the former is a Set of 
> symbols, while the latter is a List of Symbols.
> 
> Regards,
> David Huen
> 
> 
> 
> 
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
>