[Biojava-dev] Modifications to DistributionTools.java

Lachlan Coin lc1 at sanger.ac.uk
Tue Feb 25 10:10:21 EST 2003


Hi Mark,

I think we both agree on what the outcome of
 bitsOfInformation() should be (i.e. the total entropy of the ensemble).
My change does not affect that calculation.

However we disagree on the definition of the shannon information content
of an outcome x.  To carry on the discussion of the weighted coin below
(1/10 heads, 9/10 tails).  What I am proposing, is that under that
shannonEntropy method, the entropy of the outcome heads, should be log(10)
(approx 3.32 bits) and the entropy of the outcome tails should be
log(10/9) (approx 0.15 bits).  Of course - as you point out - under
shannons theory a coin can have maximum one bit of information, and in
this case the entropy of the coin is the weighted average of these two
numbers:    1/10* 3.32 + 9/10* 0.15 =  0.46 bits.  So, as you would expect
for a biased coin, the entropy is much less than the shannon maximum of one
bit.

Think of it as tossing this coin 100 times.  Whenever we get a head, then
we have gained 3.32 bits of information.  Whenever we get a tail, we only
get 0.15 bits of information.  Because we only get a head once in every
ten throws and tails 9 in each ten, the average information per throw is,
as above, 1/10*3.32 +9/10*0.15 = 0.46

This also fits more naturally with Shannon's source coding theory.
Imagine you want to communicate the results of a set experiments with
distribution p(x_i) over some set of outcomes x_i for each experiment.
We assume each experiment has the same distribution.
To do this, you create a codeword for each outcome, which is a just a
binary string.  However, you need to ensure that the receiver can uniquely
decode your string (we assume you don't include a symbol indicating that
you have finished the string for one experiment, and are commencing the
string for the next experiment).  One way to do this is to create what is
called a prefix code.  This means that no codeword is the prefix of
another code word.  This type of code is instaneously decipherable.  These
codes also correspond to binary trees.  It turns out that the most
efficient choice of codewords (in terms of minimizing the overall
length of the communication) has the property that the length of the
codeword representing x_i (call it l_i) is as close to possible to
log(1/p_i) (or equal to log(1/p_i) it this is an integer).  If you do
this, then the total length of the communication will be close to n*H(x).
  So the term log(1/p_i) has a natural interpretation as the
best choice for codeword length to communicate the results of the experiment.
In this sense, it is a natural choice to assign as the "information
content" of an outcome - as it is the information  cost of transmitting
that outcome using the most efficient communication method you can.


Anyhow, sorry for changing it before you had a chance to voice your
disagreement!  Hope the hackathon was fun.

Regards,

Lachlan














On Tue, 25 Feb 2003, Schreiber, Mark wrote:

> Sorry to not get bak to you earlier (been in Singapore for the
> Hackathon).
>
> My understanding is that log (1/p) is log odds which is not shannon
> entropy (Shannon 1948) I may be wrong on this but it not could the
> method be changed back and a new log odds method be added to do what you
> are calculting here.
>
> Under shannon's theory a coin can only hold one bit of information.
>
> - Mark
>
>
> > -----Original Message-----
> > From: Lachlan Coin [mailto:lc1 at sanger.ac.uk]
> > Sent: Thursday, 13 February 2003 11:17 p.m.
> > To: biojava-dev at biojava.org
> > Subject: [Biojava-dev] Modifications to DistributionTools.java
> >
> >
> > Hi,
> >
> > I have been using DistributionTools.java and wanted to commit
> > a few changes.  If noone particularly objects, then I will go
> > ahead and commit these
> >
> >   - I have added a method:
> > public Distribution jointDistributionOverAlignment(Alignment a,
> >                                boolean countGaps,
> >                                double nullWeight, int[] cols)
> >     this just returns the joint distribution of several
> > columns in the alignment.  It is useful for calculating
> > mutual information for two columns in an alignment
> >
> >   -  I have changed
> > 	public Map shannonEntropy(distribution observed, double logBase)
> >
> > 	 currently this creates a symbol->entropy map, and in
> > the map it puts p*log(1/p).  I think it is more natural to
> > put log(1/p) in the map, as this is a reflection of the
> > uncertainty of a particular outcome (the other is the
> > weighted uncertainty).  I.e. if we have a weighted coin with
> > 0.1% probability of heads, then  a head carries log(10) bits
> > of information.  I have also set things up so that the Map
> > only has entries for symbols which have non-zero probability.
> >
> >    - Consequently, I have also changed
> >
> > 	public double bitsOfInformation(Distribution observed)
> >
> > 	as it reliead on shannonEntropy to calculate this.  It
> > now calculates the shannonEntropy map, and takes the average
> > of the values in this map, weighted according to the
> > probability according to the observed distribution.
> >
> >
> > I have also added some jUnit tests to test these methods.
> >
> > Thanks,
> >
> > Lachlan
> >
> >
> >
> > -------------------------------------------------------------
> > Lachlan Coin
> > Wellcome Trust Sanger Institute		Magdalene College
> > Cambridge  CB10 1SA			Cambridge CB30AG
> > Ph: +44 1223 494 820
> > Fax: +44 1223 494 919
> > ------------------------------------------------------------
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at biojava.org
> > http://biojava.org/mailman/listinfo/biojava-dev
> >
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>

-------------------------------------------------------------
Lachlan Coin
Wellcome Trust Sanger Institute		Magdalene College
Cambridge  CB10 1SA			Cambridge CB30AG
Ph: +44 1223 494 820
Fax: +44 1223 494 919
------------------------------------------------------------



More information about the biojava-dev mailing list