[Biojava-dev] Current Alphabet design and an unintended
consequencefor training
mark schreiber
markjschreiber at hotmail.com
Sun Jan 4 08:30:48 EST 2004
David,
I think you may be able to use solution 1. As Symbol is a superclass
(interface) of AtomicSymbol you won't be breaking any API.
- Mark
> -----Original Message-----
> From: biojava-dev-bounces at portal.open-bio.org
> [mailto:biojava-dev-bounces at portal.open-bio.org] On Behalf Of
> David Huen
> Sent: Saturday, 3 January 2004 10:33 p.m.
> To: biojava-dev at biojava.org
> Subject: [Biojava-dev] Current Alphabet design and an
> unintended consequencefor training
>
> The current Alphabet system uses the BasisSymbol to represent
> both ambiguity symbols and symbols from cross product
> alphabets. This casues unintended consequences for training
> algorithms.
>
> The DistributionTrainerContext has an addCount(Distribution
> dist, Symbol sym, double times) method. When using cross
> product alphabets, it works flawlessly when it encounters
> AtomicSymbols from the cross-product alphabet (and these are
> also BasisSymbols). In the design of the DistributionTrainer
> interface, the equivalent method addCount(Distribution dist,
> AtomicSymbol sym, double times) accepts only an AtomicSymbol,
> which is reasonable.
>
> However, when training two-head distributions, it is not
> implausible for the
> DistributionTrainerContext.addCount() to receive Symbols that
> are not AtomicSymbols. The most common by far would be
> symbols emitted by gap states of form e.g (gap cytosine).
> The current implementation of the addCount method assumes
> that non-atomic symbols are ambiguity symbols and attempts to
> deal with them in that manner. Evidently it fails in the
> above case, indeed, it fails silently. This problem
> currently prevents the training of PairDistributions in which
> one component Distribution is a GapDistribution.
>
> There appears to be no easy way of fixing this problem at the
> level of
> DistributionTrainerContext. It is formally possible that the
> BasisSymbol received by addCount is truly an ambiguity symbol
> containing a number of symbols from the cross-product
> alphabet of the two-head HMM model. It is also possible that
> the BasisSymbol represents a single symbol comprising
> ambiguity symbol(s) from one or both alphabets that form the
> cross product alphabet. The two are evidently not equivalent
> and have to be dealt with differently. And resolving which
> it is is potentially computationally costly for an operation
> that is repeated very many times during training.
>
> Even if this ambiguity could be resolved at the level of
> DistributionTrainerContext and you knew the symbol to be one
> of type (gap <something else>), that symbol cannot be passed
> to a DistributionTrainer that may be capable of dealing with
> it as the addCount method in that interface accepts only
> atomic symbols which something like (gap guanine) is not.
>
> Interim solutions could be:-
> 1) change the DistributionTrainer.addCount() to accept
> non-atomic symbols.
> DistributionTrainerContext's addCount method will leave it to
> the distribution trainers to sort out what to do with
> non-atomic symbols themselves.
> OR
> 2) add a ExtendedDistributionTrainer interface with one
> method addCount that can accept non-atomic symbols.
> DistributionTrainerContext's addCount method will check
> whether the symbol it receives is atomic. If it is, it will
> use the standard DistributionTrainer.addCount(). If not, it
> will determine if the trainer for that distribution
> implements the ExtendedDistributionTrainer interface and if
> so, call that interface's addCount method to leave it to deal
> with the symbol. If not, it will assume that the symbol is
> an ambiguity symbol and deal with it in the manner it does now.
>
> (2) is probably less disruptive to existing code and
> interfaces. It may be that the DistributionTrainer is a
> better place to deal with non-atomic symbols than
> DistributionTrainerContext since that the DT knows more about
> the internals of that Distribution and what it can/should
> handle while the DTC has to of necessity implement a
> one-size-fits-all approach.
>
> At Biojava 2, it may be worthwhile to revisit the Alphabet
> design and explicitly distinguish ambiguity symbols and
> BasisSymbols on the level that the former is a Set of
> symbols, while the latter is a List of Symbols.
>
> Regards,
> David Huen
>
>
>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
>
More information about the biojava-dev
mailing list