[Biojava-l] RE: Bug in HashedAlphabetIndex??
Schreiber, Mark
mark.schreiber@agresearch.co.nz
Thu, 8 Mar 2001 11:13:05 +1300
> -----Original Message-----
> From: Matthew Pocock [mailto:mrp@sanger.ac.uk]
> Sent: Thursday, March 08, 2001 5:00 AM
> To: Schreiber, Mark
> Cc: 'biojava-l@biojava.org'
> Subject: Re: [Biojava-l] RE: Bug in HashedAlphabetIndex??
>
>
> Hi Mark,
>
> I've fixed this on the main trunk. Thomas, could you port this to the
> 1.1 branch?
>
Great, seems to work now. What was the problem??
> The two issues this brings up are
>
> a) I think that the SymbolList FiniteAlphabet.symbols() is
> unnecisary.
> If you want to iterate over an alphabet or find its size, you
> can just
> use the methods in FiniteAlphabet. If you wish to impose some
> ordering
> on the FiniteAlphabet, then you can use an AlphabetIndexer object
> obtainable via AlphabetManager.getAlphabetIndex(alpha). I have
> depricated this method, but I think it should remain un-depricated on
> the release branch.
>
Ok with me.
> b) The default distribution objects construct a distribution with as
> many parameters as there are symbols in your alphabet, and one fewer
> free parameters (as they must sum to 1). I have a gut feeling that
> building a probability distribution over a very large number
> of symbols
> (e.g. > DNA hexamers) is probably silly (although i've now
> tested it for
> dna^7), as you won't have enough data. This may mandate the use of
> custom Distribution implementations that do clever data-smoothing, or
> that have far fewer parameters (e.g. make dna^7 using a function of a
> simple dna^2 matrix).
>
True. Most genefinder programs use hexamers, Glimmer uses upto dna^9 but
implements an interpolation method to estimate these from lower order data
so this sort of thing would require a custom distribution. I dont know if
anyone has proven a significant advantage in using anything above hexamers
so I don't think this will be needed in the near future unless someone is
wanting to investigate the effect of the helical periodicity of DNA for
which dna^10 or dna^11 may be needed. This would probably require several
large genomes to sample this without a custom Distribution.
Mark