[Bioperl-l] SiteMatrix changes

Sendu Bala bix at sendu.me.uk
Thu Aug 31 16:47:36 UTC 2006


skirov wrote:
>> What has adding the number 1 to some but not all input numbers got
>> to do with pseudo counts? Can you explain your thinking?
> 
> The code was: if ($self->{_corrected}) { ${$self->{probA}}[$i] +=
> $self->{_correction}; ${$self->{probC}}[$i] += $self->{_correction}; 
> ${$self->{probG}}[$i] += $self->{_correction}; ${$self->{probT}}[$i]
> += $self->{_correction}; } Add 1 (or the user supplied correction
> value) to any position that has 0. Perhaps you are right (if I
> understood correctly) and 1 should be added to everything if any
> position contains 0. I am not really sure abut this.

If you're going to do the correction, you always do it, not just when 
one of the positions contains 0. I imagine you were detecting 0 as an 
indicator that no correction had been done, but its possible no 
correction has been done even if none of the positions has a 0.


>> Why is having 0 a bad idea?
> 
> Here is a wikipedia explanation: "In any observed data set or sample
> there is the possibility, especially with low-probability events
> and/or small data sets, of a possible event not occurring. Its
> observed frequency is therefore 0, implying a probability of 0. This
> is an oversimplification and is often unhelpful, particularly in 
> probability-based machine learning techniques such as artificial
> neural networks and hidden Markov models."

Yes, and it goes on to say:

"The simplest approach is to add 1 to each observed number of events
including the zero-count one. [...] A more complex approach is to
estimate the probability of the events from other factors and adjust
accordingly. Neither approach is completely satisfactory and both are a
bit of a fudge."


>> I don't think the module should be trying to do any kind of
>> analysis, especially given that it has no idea of the source of its
>> input data. It must just accept what it is given. If a user or
>> other module wants to do pseudo-count correction, they can do it 
>> themselves in the most appropriate way for their data.
> 
> You are wrong here- this gives an option to the user since correction
> can be disabled (which should be the case with frequencies.). In most
> cases pseudo counts are necessary and that is why this should be the
> default behavior.

But there are lots of possible algorithms for doing pseudo count
correction, adding a number being the worst. There's no gain to adding
the number. The option only serves to a) hide the problem but not fix
it, and b) give the wrong frequencies. Two dangerous things.

But ok, I'll revert to the previous behaviour and make the docs very 
clear about what is going on. My main problem was your implementation of 
it, which gave really bad results since you didn't apply the correction 
to everything.


>> IUPAC has no concept of frequencies or have a cutoff. When there is
>> a chance of all four bases (complete ambiguity), the IUPAC code is
>> N. If you want it to return 'R' in this case, the IUPAC method
>> would need to be extended to allow input of a user-defined 
>> threshold defining what frequencies to ignore.
> 
> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
> T is 0.00004 you would have N???

Yes, because that is correct. There's no way to judge automatically what
frequencies should be ignored, unless we know exactly how and if psuedo
count correction was done. Not guessing and being correct is better than
guessing to give 'nicer' answers but sometimes being completely wrong 
(and always being strictly wrong). Really, if you want a nice IUPAC 
string you must simply chose not to do pseudo count correction.


> Allowing customer supplied thresholds is not a bad idea, you could
> implement it if you wish. But please do not fix something that is not
> broken.

It was giving the wrong answers, so needed fixing. I'll add the 
threshold option.




More information about the Bioperl-l mailing list