[Biojava-dev] Count.java, Distribution.java and Alignment.java

Matthew Pocock matthew_pocock at yahoo.co.uk
Fri Feb 14 21:01:54 EST 2003


Lachlan Coin wrote:
> I just had a few comments about these interfaces, which would make them
> easier/more efficient for me to use.

Great

> 
> It would be great if both these interfaces enforced a nonZeroSymbols()
> method which returned the set of symbols have a non-zero count /
> probability respectively.  Particularly if you are working with sparse
> counts over high dimensional cross-product alphabets, it seems pretty
> inefficient to iterate through all the members of a cross-product
> alphabet when only a  small fraction of these have counts.  This also
> relates to storage - it would be good to have a DistributionFactory that
> could create sparse distributions.

Sparcity would be a good thing. Feel free to write an implementation 
that does this and come up with a method name and signature and return 
type for nonZeroSymbols(). We can fold the impl in behind the factory 
interface so that people don't know about it.

> 
> Also, this is more minor, but Count uses doubles rather than integers,
> which is certainly more flexible, but would seem to take more memory.  Is
> this flexibility needed - isn't Distribution supposed to be for this?

It is used in the training of HMMS. During the forwards-backwards step 
of parameter estimation, counts are added in proportion to the 
probability that parameters are used. It is actualy quite senestive to 
these numbers being correct - if they are rounded too much either way, 
the models do very strange things (like fitting the data worse and worse 
each itteration).

> 
> 
> Finally, in Alignment.java, there are two methods, which use
> inconsitent container classes for the labels of the alignment.
> 
> java.util.List getLabels()
> 
> Alignment subAlignment(java.util.Set labels, Location loc)
> 
> so that to get a subAlignment over all labels, you have to convert a List
> to a Set.

I think it's subAlignment that should be a List. Unfortunately, we can't 
fix this right now as it would change the API while we're trying to get 
the 1.3 release out of the door. Once 1.3 hits the street, feel free to 
change it (leave the old method in with a @deprecated for a while).

Matthew

> 
> 
> Thanks,
> 
> Lachlan
> 
> -------------------------------------------------------------
> Lachlan Coin
> Wellcome Trust Sanger Institute		Magdalene College
> Cambridge  CB10 1SA			Cambridge CB30AG
> Ph: +44 1223 494 820
> Fax: +44 1223 494 919
> ------------------------------------------------------------
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
> 


-- 
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk



More information about the biojava-dev mailing list