[Biojava-dev] Distributions Gaps and Residual counts
mark.schreiber at group.novartis.com
mark.schreiber at group.novartis.com
Mon Mar 1 22:05:25 EST 2004
Hi -
I have modified (hacked might be a better word) AbstractDistribution to
take advantage of a weakness in the setWeights() method of Distribution so
that Distributions can hold the weight of Gaps. Following up on a tip for
which Thomas Down deserves the credit.
When you set the weights of Symbols using the setWeights() method there is
no contract that says the weights have to add to one. Weights can be less
than one which means some residual weight is not assigned to any Symbol.
For this reason it is reccommended that Distributions are trained using a
DistributionTrainerContext so this behaivour is avoided. You wouldn't
notice this residual weight unless you tried to sample from the
Distribution in which case you could get an exception if it tried to
return the Symbol that didn't exist.
I have changed the AbstractDistribution so that any residual weight is
assigned to the Gap symbol. You can get the weight of the gap Symbol with
the getWeight() method. You don't really need to set it as you can just
set the weights of the other Symbols and leave some room for the gap as
residual weight. You cannot train gaps in as ultimately training requires
Symbols to be reduced to AtomicSymbols. It is not possible to make Gap
Atomic without changing half the Symbol and Distribution API's which
seemed pretty tiresome. I would vote for gaps being atomic in any redesign
of biojava. I generally don't reccomend you play around with residual
weight but I post it here to inform any keen developer of the possibility.
Anyhow, the reason for this hack is that it allows the DistributionTools
method distOverAlignment() to keep track of the frequency of gaps at any
position in an Alignment. This is probably the only reccommended way of
producing a Distribution with gap weight. The change probably makes
setWeight() slightly safer to use and if you see gaps in your sample()s
then you have probably messed up setting weights in you Distribution.
Let me know if this change causes any unexpected strangeness.
Mark Schreiber
Principal Scientist (Bioinformatics)
Novartis Institute for Tropical Diseases (NITD)
1 Science Park Road
#04-14 The Capricorn, Science Park II
Singapore 117528
phone +65 6722 2973
fax +65 6722 2910
More information about the biojava-dev
mailing list