[Biojava-l] RE: Bug in HashedAlphabetIndex??

Thu, 8 Mar 2001 11:13:05 +1300

> -----Original Message-----
> From: Matthew Pocock [mailto:mrp@sanger.ac.uk]
> Sent: Thursday, March 08, 2001 5:00 AM
> To: Schreiber, Mark
> Cc: 'biojava-l@biojava.org'
> Subject: Re: [Biojava-l] RE: Bug in HashedAlphabetIndex??
> 
> 
> Hi Mark,
> 
> I've fixed this on the main trunk. Thomas, could you port this to the 
> 1.1 branch?
> 

Great, seems to work now. What was the problem??

> The two issues this brings up are
> 
> a) I think that the SymbolList FiniteAlphabet.symbols() is 
> unnecisary. 
> If you want to iterate over an alphabet or find its size, you 
> can just 
> use the methods in FiniteAlphabet. If you wish to impose some 
> ordering 
> on the FiniteAlphabet, then you can use an AlphabetIndexer object 
> obtainable via AlphabetManager.getAlphabetIndex(alpha). I have 
> depricated this method, but I think it should remain un-depricated on 
> the release branch.
> 

Ok with me.

> b) The default distribution objects construct a distribution with as 
> many parameters as there are symbols in your alphabet, and one fewer 
> free parameters (as they must sum to 1). I have a gut feeling that 
> building a probability distribution over a very large number 
> of symbols 
> (e.g. > DNA hexamers) is probably silly (although i've now 
> tested it for 
> dna^7), as you won't have enough data. This may mandate the use of 
> custom Distribution implementations that do clever data-smoothing, or 
> that have far fewer parameters (e.g. make dna^7 using a function of a 
> simple dna^2 matrix).
> 

True. Most genefinder programs use hexamers, Glimmer uses upto dna^9 but
implements an interpolation method to estimate these from lower order data
so this sort of thing would require a custom distribution. I dont know if
anyone has proven a significant advantage in using anything above hexamers
so I don't think this will be needed in the near future unless someone is
wanting to investigate the effect of the helical periodicity of DNA for
which dna^10 or dna^11 may be needed. This would probably require several
large genomes to sample this without a custom Distribution.

Mark