[Biopython-dev] [Biopython] Bio.motifs raising Exceptions using pypy

Peter Cock p.j.a.cock at googlemail.com
Fri Jul 12 10:48:10 UTC 2013


On Fri, Jul 12, 2013 at 11:00 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Jul 12, 2013 at 10:40 AM, Marco Galardini
> <marco.galardini at unifi.it> wrote:
>> Hi,
>>
>> i've arranged a sample script and sample data to replicate the issue:
>>
>> python test.py test.fa test.txt
>> 551 20.9172
>> -5389 21.0426
>>
>> pypy test.py test.fa test.txt
>> 551 20.9172
>> -5389 21.0426
>>
>> Traceback (most recent call last):
>>   File "app_main.py", line 72, in run_toplevel
>>   File "test.py", line 20, in <module>
>>     for position, score in pssm.search(s.seq, threshold=score_t):
>>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 354, in search
>>     score = self.calculate(s)
>>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 331, in calculate
>>     score += self[letter][position]
>>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 113, in __getitem__
>>     return dict.__getitem__(self, letter)
>> KeyError: 'N'
>>
>> Hope this helps, my guess is that it may be something related to the
>> implementation of dictionaries in pypy, since the object raising the
>> exception inherits dict.
>>
>> Thanks a lot for the help,
>> Marco
>
> Great - I can reproduce that here using PyPy 1.9 as well...
>

OK - this also breaks under Jython and even Python if we
disable the C extension. Here self[letters] only has ACGT,
not N, thus a key error. This is something the C code just
ignores. There is also an inconsistency with mixed case.

New unit test:
https://github.com/biopython/biopython/commit/e13c97ae3535b58d8ec3da3fc565e97db1fa75a3

Fix for the mixed case difference:
https://github.com/biopython/biopython/commit/0cab00c66a1fd15072d020cfc17edbdfb37484a5

The KeyError from bad characters can be handled like this:

$ git diff
diff --git a/Bio/motifs/matrix.py b/Bio/motifs/matrix.py
index bce1d4f..e6446b5 100644
--- a/Bio/motifs/matrix.py
+++ b/Bio/motifs/matrix.py
@@ -364,7 +364,11 @@ class PositionSpecificScoringMatrix(GenericPositionMatrix):
                 score = 0.0
                 for position in xrange(m):
                     letter = sequence[i+position]
-                    score += self[letter][position]
+                    try:
+                        score += self[letter][position]
+                    except KeyError:
+                        #The C code ignores unexpected letters like N
+                        pass
                 scores.append(score)
         else:
             # get the log-odds matrix into a proper shape

However, that leaves a numerical difference in the output:

$ pypy test_motifs.py
test_simple (__main__.MotifTestPWM)
Test if Bio.motifs PWM scoring works. ... ok
test_with_bad_char (__main__.MotifTestPWM)
Test if Bio.motifs PWM scoring works with unexpected letters like N. ... FAIL
test_with_mixed_case (__main__.MotifTestPWM)
Test if Bio.motifs PWM scoring works with mixed case. ... ok
...
======================================================================
FAIL: test_with_bad_char (__main__.MotifTestPWM)
Test if Bio.motifs PWM scoring works with unexpected letters like N.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_motifs.py", line 1662, in test_with_bad_char
    self.assertTrue(_isnan(result[6]), "Expected nan, not %r" % result[6])
AssertionError: Expected nan, not -37.417418833750574

----------------------------------------------------------------------
Ran 15 tests in 0.077s

FAILED (failures=1)

The same error occurs on Jython, and on Python if I disable
the C extension. This needs a little more investigation... I
don't immediately follow when the C code sets the value
to nan.

Peter



More information about the Biopython-dev mailing list