[Biojava-dev] Modifications to DistributionTools.java

Wed Feb 26 11:42:09 EST 2003

Ah I see where your coming from now, previously I was returning the
weighted entropy not the actual entropy. My bad :(

Thanks for spotting this.

- Mark

> -----Original Message-----
> From: Lachlan Coin [mailto:lc1 at sanger.ac.uk] 
> Sent: Tuesday, 25 February 2003 11:10 p.m.
> To: Schreiber, Mark
> Cc: biojava-dev at biojava.org
> Subject: RE: [Biojava-dev] Modifications to DistributionTools.java
> 
> 
> Hi Mark,
> 
> I think we both agree on what the outcome of
>  bitsOfInformation() should be (i.e. the total entropy of the 
> ensemble). My change does not affect that calculation.
> 
> However we disagree on the definition of the shannon 
> information content of an outcome x.  To carry on the 
> discussion of the weighted coin below (1/10 heads, 9/10 
> tails).  What I am proposing, is that under that 
> shannonEntropy method, the entropy of the outcome heads, 
> should be log(10) (approx 3.32 bits) and the entropy of the 
> outcome tails should be
> log(10/9) (approx 0.15 bits).  Of course - as you point out - 
> under shannons theory a coin can have maximum one bit of 
> information, and in this case the entropy of the coin is the 
> weighted average of these two
> numbers:    1/10* 3.32 + 9/10* 0.15 =  0.46 bits.  So, as you 
> would expect
> for a biased coin, the entropy is much less than the shannon 
> maximum of one bit.
> 
> Think of it as tossing this coin 100 times.  Whenever we get 
> a head, then we have gained 3.32 bits of information.  
> Whenever we get a tail, we only get 0.15 bits of information. 
>  Because we only get a head once in every ten throws and 
> tails 9 in each ten, the average information per throw is, as 
> above, 1/10*3.32 +9/10*0.15 = 0.46
> 
> This also fits more naturally with Shannon's source coding 
> theory. Imagine you want to communicate the results of a set 
> experiments with distribution p(x_i) over some set of 
> outcomes x_i for each experiment. We assume each experiment 
> has the same distribution. To do this, you create a codeword 
> for each outcome, which is a just a binary string.  However, 
> you need to ensure that the receiver can uniquely decode your 
> string (we assume you don't include a symbol indicating that 
> you have finished the string for one experiment, and are 
> commencing the string for the next experiment).  One way to 
> do this is to create what is called a prefix code.  This 
> means that no codeword is the prefix of another code word.  
> This type of code is instaneously decipherable.  These codes 
> also correspond to binary trees.  It turns out that the most 
> efficient choice of codewords (in terms of minimizing the 
> overall length of the communication) has the property that 
> the length of the codeword representing x_i (call it l_i) is 
> as close to possible to
> log(1/p_i) (or equal to log(1/p_i) it this is an integer).  
> If you do this, then the total length of the communication 
> will be close to n*H(x).
>   So the term log(1/p_i) has a natural interpretation as the 
> best choice for codeword length to communicate the results of 
> the experiment. In this sense, it is a natural choice to 
> assign as the "information content" of an outcome - as it is 
> the information  cost of transmitting that outcome using the 
> most efficient communication method you can.
> 
> 
> Anyhow, sorry for changing it before you had a chance to 
> voice your disagreement!  Hope the hackathon was fun.
> 
> Regards,
> 
> Lachlan
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, 25 Feb 2003, Schreiber, Mark wrote:
> 
> > Sorry to not get bak to you earlier (been in Singapore for the 
> > Hackathon).
> >
> > My understanding is that log (1/p) is log odds which is not shannon 
> > entropy (Shannon 1948) I may be wrong on this but it not could the 
> > method be changed back and a new log odds method be added 
> to do what 
> > you are calculting here.
> >
> > Under shannon's theory a coin can only hold one bit of information.
> >
> > - Mark
> >
> >
> > > -----Original Message-----
> > > From: Lachlan Coin [mailto:lc1 at sanger.ac.uk]
> > > Sent: Thursday, 13 February 2003 11:17 p.m.
> > > To: biojava-dev at biojava.org
> > > Subject: [Biojava-dev] Modifications to DistributionTools.java
> > >
> > >
> > > Hi,
> > >
> > > I have been using DistributionTools.java and wanted to 
> commit a few 
> > > changes.  If noone particularly objects, then I will go ahead and 
> > > commit these
> > >
> > >   - I have added a method:
> > > public Distribution jointDistributionOverAlignment(Alignment a,
> > >                                boolean countGaps,
> > >                                double nullWeight, int[] cols)
> > >     this just returns the joint distribution of several 
> columns in 
> > > the alignment.  It is useful for calculating mutual 
> information for 
> > > two columns in an alignment
> > >
> > >   -  I have changed
> > > 	public Map shannonEntropy(distribution observed, double logBase)
> > >
> > > 	 currently this creates a symbol->entropy map, and in
> > > the map it puts p*log(1/p).  I think it is more natural to put 
> > > log(1/p) in the map, as this is a reflection of the 
> uncertainty of a 
> > > particular outcome (the other is the weighted 
> uncertainty).  I.e. if 
> > > we have a weighted coin with 0.1% probability of heads, 
> then  a head 
> > > carries log(10) bits of information.  I have also set 
> things up so 
> > > that the Map only has entries for symbols which have non-zero 
> > > probability.
> > >
> > >    - Consequently, I have also changed
> > >
> > > 	public double bitsOfInformation(Distribution observed)
> > >
> > > 	as it reliead on shannonEntropy to calculate this.  It
> > > now calculates the shannonEntropy map, and takes the 
> average of the 
> > > values in this map, weighted according to the probability 
> according 
> > > to the observed distribution.
> > >
> > >
> > > I have also added some jUnit tests to test these methods.
> > >
> > > Thanks,
> > >
> > > Lachlan
> > >
> > >
> > >
> > > -------------------------------------------------------------
> > > Lachlan Coin
> > > Wellcome Trust Sanger Institute		Magdalene College
> > > Cambridge  CB10 1SA			Cambridge CB30AG
> > > Ph: +44 1223 494 820
> > > Fax: +44 1223 494 919
> > > ------------------------------------------------------------
> > >
> > > _______________________________________________
> > > biojava-dev mailing list
> > > biojava-dev at biojava.org 
> > > http://biojava.org/mailman/listinfo/biojava-dev
> > >
> > 
> ======================================================================
> > =
> > Attention: The information contained in this message and/or 
> attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential 
> and/or privileged
> > material. Any review, retransmission, dissemination or 
> other use of, or
> > taking of any action in reliance upon, this information by 
> persons or
> > entities other than the intended recipients is prohibited 
> by AgResearch
> > Limited. If you have received this message in error, please 
> notify the
> > sender immediately.
> > 
> ==============================================================
> =========
> >
> 
> -------------------------------------------------------------
> Lachlan Coin
> Wellcome Trust Sanger Institute		Magdalene College
> Cambridge  CB10 1SA			Cambridge CB30AG
> Ph: +44 1223 494 820
> Fax: +44 1223 494 919
> ------------------------------------------------------------
> 
> 
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================