[Biojava-l] adding counts to Dists
Thomas Down
td2@sanger.ac.uk
Thu, 13 Dec 2001 11:14:54 +0000
On Thu, Dec 13, 2001 at 10:33:52PM +1300, Mark Schreiber wrote:
> Hi -
>
> When adding a large number of counts to a Distribution via a trainer i
> have found it is much quicker to store the counts in and array (indexed by
> the AlphabetIndex for that alphabet). Increment the counts as each symbol
> comes in and then add the counts to the trainer at the end. (followed by
> the .train() method).
>
> I'm curious as to why this is. I assume its cause the trainer checks the
> validity of each symbol, although technically so does the AlphabetIndex by
> looking up the index for the symbol.
>
> Not that this is a major issue it might just be a way to speed up
> distribution training
Do you know what what implementation of Distribution you're
using? SimpleDistribution uses a fairly sensible DistributionTrainer
object (which uses an Indexed and an array -- pretty much the
same as you are). However, I notice that there's also something
called SimpleDistributionTrainer. This is storing counts in
a Map<Symbol, Double>, and I suspect is likely to be /much/
less efficient -- especially as there's object churn every time
a new count is added.
If the distribution you're using is still using a SimpleDistributuinTrainer,
I'd guess that could cause some fairly dreadful performance.
Thomas.