[Bioperl-l] SiteMatrix changes

Sendu Bala bix at sendu.me.uk
Thu Aug 31 18:50:28 UTC 2006


skirov wrote:
>
>>>> IUPAC has no concept of frequencies or have a cutoff. When there is
>>>> a chance of all four bases (complete ambiguity), the IUPAC code is
>>>> N. If you want it to return 'R' in this case, the IUPAC method
>>>> would need to be extended to allow input of a user-defined
>>>> threshold defining what frequencies to ignore.
 >
>>> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
>>> T is 0.00004 you would have N???
 >
>> Yes, because that is correct.
> 
> No it is not- IUPAC is an approximation of a PFM, which is very different from 
> what you are doing.

The IUPAC-IUB nomenclature for incompletely specified bases in nucleic 
acid sequences is just that and no more. They were not trying to 
'approximate a PFM'. It is a nomenclature for describing uncertainty.

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

If you tell me that you are uncertain what the nucleic acid is by giving 
me a non-zero probability of having all nucleic acids, the appropriate 
IUPAC symbol is 'N'.


>> There's no way to judge automatically what
>> frequencies should be ignored, unless we know exactly how and if psuedo
>> count correction was done. Not guessing and being correct is better than
>> guessing to give 'nicer' answers but sometimes being completely wrong
>> (and always being strictly wrong). Really, if you want a nice IUPAC
>> string you must simply chose not to do pseudo count correction.
 >
> Sure, when you create the object add -correction=>0.

Right, so you're happy with getting IUPAC strings of all 'N' unless 
-correction => 0 has been used? When a -correction is requested, the 
only way to get non Ns should be to supply a threshold. There could be a 
default threshold, but what would it be?


>>> Allowing customer supplied thresholds is not a bad idea, you could
>>> implement it if you wish. But please do not fix something that is not
>>> broken.
 >>
>> It was giving the wrong answers, so needed fixing. I'll add the
>> threshold option.
> 
> No it was not giving wrong answers, the error was not in this function, so it 
> did not need fixing.

If you revert just _calculate_consensus() and _to_IUPAC() to the 
previous implementation, test 34 on line 144 of t/psm.t gives the answer:

VVDCAGGTGBYD

having parsed t/data/transfac.dat

That file has the matrix:
P0      A      C      G      T
01      1      2      2      0
02      2      1      2      0
03      3      0      1      1
04      0      5      0      0
05      5      0      0      0
06      0      0      4      1
07      0      1      4      0
08      0      0      0      5
09      0      0      5      0
10      0      1      2      2
11      0      2      0      3
12      1      0      3      1

The correct answer here is:

VVDCAKSTGBYD

To call positions 6 and 7 Gs is wrong. This is the problem with having a 
default threshold. It is dangerous in sometimes giving the wrong answer 
and the user would never realize.


If we revert all my changes the answer given is:

NNNNNNNNNNNN

If you think you can fix this bug another way (that isn't specific to 
Bio::Matrix::PSM::IO::transfac), I'll revert the changes and create a 
bug report for you.



More information about the Bioperl-l mailing list