[Bioperl-l] SiteMatrix changes

aaron.j.mackey at gsk.com aaron.j.mackey at gsk.com
Thu Aug 31 18:29:56 UTC 2006


It seems to me that some of the tension here stems from an analysis-like 
transformation/derivation that is buried within a data object rather than 
being separately instantiated in an analysis factory object.

It's a little like having the (DNA) sequence object be able to translate 
itself: yes, you want to provide the functionality directly to the data 
object, and you want it to do something sensible without any further 
parameterization, but you also want it to be parameterizable because who 
knows what thresholds/behaviors anyone will ever want.  Thus we have 
Bio::Tools::CodonTable as the analysis factory, and a "convenience" 
method, "translate", available at the sequence object level (which 
delegates to an anonymous factory).  There have been many arguments in the 
past about the default behavior of that convenience method, which goes to 
show that there's not always one right answer (though there can only be 
one default, in the end).
 
Next, this whole "what does IUPAC mean" argument is a bit odd.  Both of 
you are right, and wrong.  IUPAC ambiguity codes are just that, ambiguity 
codes that are useful to refer to particular subsets of residues, without 
respect to relative frequencies.  A position-specific frequency matrix may 
be transformed into a "consensus" code that utilizes IUPAC codes, but only 
after some threshold for inclusion/exclusion is chosen (and this must be a 
user option, as there is no right answer).  The code did have a default 
threshold, which might have been practical, I cannot say.  But I do know 
that arguing whether threshold A is more practical/useful than threshold B 
is like arguing over whose car is shinier.

Hilmar has already nicely reminded us about pseudocounts and their use, so 
I have nothing more to say, but that again I think choices of pseudocount 
"strategies" should include the option of no pseudocounts, and be nicely 
packaged up into some form of analysis factory.

Lastly, it is rather poor form to swoop down on some piece of code and 
fundamentally alter the way it behaves without warning and discussion. 
Presumably you had to change the test suite to get the new behavior to 
pass tests, which is always a sign that some form of discussion needs to 
happen (e.g. perhaps backwards compatibility is an issue).

So, lets play nicely and work out an understandable solution together, 
with a little less bravado and hubris.

-Aaron

bioperl-l-bounces at lists.open-bio.org wrote on 08/31/2006 01:26:48 PM:

> >===== Original Message From Sendu Bala <bix at sendu.me.uk> =====
> >skirov wrote:
> >>> What has adding the number 1 to some but not all input numbers got
> >>> to do with pseudo counts? Can you explain your thinking?
> >>
> >> The code was: if ($self->{_corrected}) { ${$self->{probA}}[$i] +=
> >> $self->{_correction}; ${$self->{probC}}[$i] += $self->{_correction};
> >> ${$self->{probG}}[$i] += $self->{_correction}; ${$self->{probT}}[$i]
> >> += $self->{_correction}; } Add 1 (or the user supplied correction
> >> value) to any position that has 0. Perhaps you are right (if I
> >> understood correctly) and 1 should be added to everything if any
> >> position contains 0. I am not really sure abut this.
> >
> >If you're going to do the correction, you always do it, not just when
> >one of the positions contains 0. I imagine you were detecting 0 as an
> >indicator that no correction had been done, but its possible no
> >correction has been done even if none of the positions has a 0.
> 
> Well, if none of the positions is 0, no correction is really necessary.
> 
> >
> >>> Why is having 0 a bad idea?
> >>
> >> Here is a wikipedia explanation: "In any observed data set or sample
> >> there is the possibility, especially with low-probability events
> >> and/or small data sets, of a possible event not occurring. Its
> >> observed frequency is therefore 0, implying a probability of 0. This
> >> is an oversimplification and is often unhelpful, particularly in
> >> probability-based machine learning techniques such as artificial
> >> neural networks and hidden Markov models."
> >
> >Yes, and it goes on to say:
> >
> >"The simplest approach is to add 1 to each observed number of events
> >including the zero-count one. [...] A more complex approach is to
> >estimate the probability of the events from other factors and adjust
> >accordingly. Neither approach is completely satisfactory and both are a
> >bit of a fudge."
> >
> >
> >>> I don't think the module should be trying to do any kind of
> >>> analysis, especially given that it has no idea of the source of its
> >>> input data. It must just accept what it is given. If a user or
> >>> other module wants to do pseudo-count correction, they can do it
> >>> themselves in the most appropriate way for their data.
> >>
> >> You are wrong here- this gives an option to the user since correction
> >> can be disabled (which should be the case with frequencies.). In most
> >> cases pseudo counts are necessary and that is why this should be the
> >> default behavior.
> >
> >But there are lots of possible algorithms for doing pseudo count
> >correction, adding a number being the worst. There's no gain to adding
> >the number. The option only serves to a) hide the problem but not fix
> >it, and b) give the wrong frequencies. Two dangerous things.
> >
> >But ok, I'll revert to the previous behaviour and make the docs very
> >clear about what is going on. My main problem was your implementation 
of
> >it, which gave really bad results since you didn't apply the correction
> >to everything.
> Bad results meaning? There was an error in the constructor method, 
because 
> corrections were used even if the input is a frequency. Even in thiscase 
you 
> can say -correction=>0.0000001 and be fine. So the real problem here is 
lack 
> of documentation. I wish you would have asked me before changing the 
code, 
> then most of this email exchange would have been unnecessary.
> 
> >
> >
> >>> IUPAC has no concept of frequencies or have a cutoff. When there is
> >>> a chance of all four bases (complete ambiguity), the IUPAC code is
> >>> N. If you want it to return 'R' in this case, the IUPAC method
> >>> would need to be extended to allow input of a user-defined
> >>> threshold defining what frequencies to ignore.
> >>
> >> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
> >> T is 0.00004 you would have N???
> >
> >Yes, because that is correct.
> 
> No it is not- IUPAC is an approximation of a PFM, which is very 
> different from 
> what you are doing. It is flat out wrong and IUPAC consensus has 
> nothing to do 
> with the correction.
> 
> There's no way to judge automatically what
> >frequencies should be ignored, unless we know exactly how and if psuedo
> >count correction was done. Not guessing and being correct is better 
than
> >guessing to give 'nicer' answers but sometimes being completely wrong
> >(and always being strictly wrong). Really, if you want a nice IUPAC
> >string you must simply chose not to do pseudo count correction.
> Sure, when you create the object add -correction=>0.
> >
> >
> >> Allowing customer supplied thresholds is not a bad idea, you could
> >> implement it if you wish. But please do not fix something that is not
> >> broken.
> >
> >It was giving the wrong answers, so needed fixing. I'll add the
> >threshold option.
> 
> No it was not giving wrong answers, the error was not in this function, 
so it 
> did not need fixing. Please add the default thresholds or leave it 
unchanged 
> and I will do it.
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 





More information about the Bioperl-l mailing list