[Biopython-dev] Pickle problem on 64 bit Windows with Python 3.4

Peter Cock p.j.a.cock at googlemail.com
Tue Apr 22 08:36:23 EDT 2014


On Tue, Apr 22, 2014 at 12:09 PM, Manlio Calvi <manlio.calvi at gmail.com> wrote:
> On Tue, Apr 22, 2014 at 12:44 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Mon, Apr 21, 2014 at 6:45 PM, Manlio Calvi <manlio.calvi at gmail.com> wrote:
>>> From what I read here http://hg.python.org/cpython/rev/4a6b8f86b081 could be
>>> a problem related to that file. Seems to me they stripped the check for a
>>> quote that must be in, and looking at the pickle apparently isn't
>>>
>>
>> OK, now things are more confusing - this seems to be working on
>> a colleague's machine, so it may be something different on your
>> setup. Are you using a self compiled Python 3.4?
>>
>> We installed the 64 bit version Python 3.4 on Windows 7 using the
>> binary installed from the website (Windows x86-64 MSI installer),
>> selecting for all users (which probably requires admin rights):
>> https://www.python.org/ftp/python/3.4.0/python-3.4.0.amd64.msi
>
> Exactly as I did, I installed the dependencies (numpy and the like)
> for Biopython using Gohlke's ones.
>
>> We manually downloaded the pickle file via the raw link on GitHub,
>> and tried the test code (as shown below), and it worked perfectly.
>
> I've used the standard "git pull" command from the repository.
> Moreover I'm coming from a recent format and reinstall of windows in
> this machine.
> I'm a bit lost here...

OK, I had an idea over lunch which turned out to solve this :)

First I checked that my pickle on Linux file uses Unix new lines,

$ hexdump -C acc_rep_mat.pik | head
00000000  28 64 70 31 0a 28 53 27  4c 27 0a 53 27 52 27 0a  |(dp1.(S'L'.S'R'.|
00000010  74 49 31 30 39 0a 73 28  53 27 49 27 0a 53 27 49  |tI109.s(S'I'.S'I|
00000020  27 0a 74 49 31 34 35 0a  73 28 53 27 51 27 0a 53  |'.tI145.s(S'Q'.S|
00000030  27 51 27 0a 74 49 34 32  0a 73 28 53 27 53 27 0a  |'Q'.tI42.s(S'S'.|
00000040  53 27 54 27 0a 74 49 31  37 32 0a 73 28 53 27 48  |S'T'.tI172.s(S'H|
00000050  27 0a 53 27 54 27 0a 74  49 36 39 0a 73 28 53 27  |'.S'T'.tI69.s(S'|
00000060  51 27 0a 53 27 59 27 0a  74 49 34 31 0a 73 28 53  |Q'.S'Y'.tI41.s(S|
00000070  27 48 27 0a 53 27 50 27  0a 74 49 32 33 0a 73 28  |'H'.S'P'.tI23.s(|
00000080  53 27 4e 27 0a 53 27 59  27 0a 74 49 37 35 0a 73  |S'N'.S'Y'.tI75.s|
00000090  28 53 27 48 27 0a 53 27  4c 27 0a 74 49 37 30 0a  |(S'H'.S'L'.tI70.|

Then I converted it to DOS/Windows newlines (e.g. unix2dos
is easy if you have that, or a few lines of Python if not - see below):

$ hexdump -C acc_rep_mat.dos.pik | head
00000000  28 64 70 31 0d 0a 28 53  27 4c 27 0d 0a 53 27 52  |(dp1..(S'L'..S'R|
00000010  27 0d 0a 74 49 31 30 39  0d 0a 73 28 53 27 49 27  |'..tI109..s(S'I'|
00000020  0d 0a 53 27 49 27 0d 0a  74 49 31 34 35 0d 0a 73  |..S'I'..tI145..s|
00000030  28 53 27 51 27 0d 0a 53  27 51 27 0d 0a 74 49 34  |(S'Q'..S'Q'..tI4|
00000040  32 0d 0a 73 28 53 27 53  27 0d 0a 53 27 54 27 0d  |2..s(S'S'..S'T'.|
00000050  0a 74 49 31 37 32 0d 0a  73 28 53 27 48 27 0d 0a  |.tI172..s(S'H'..|
00000060  53 27 54 27 0d 0a 74 49  36 39 0d 0a 73 28 53 27  |S'T'..tI69..s(S'|
00000070  51 27 0d 0a 53 27 59 27  0d 0a 74 49 34 31 0d 0a  |Q'..S'Y'..tI41..|
00000080  73 28 53 27 48 27 0d 0a  53 27 50 27 0d 0a 74 49  |s(S'H'..S'P'..tI|
00000090  32 33 0d 0a 73 28 53 27  4e 27 0d 0a 53 27 59 27  |23..s(S'N'..S'Y'|

This increases the file size from 3658 bytes to 4289 bytes.

$ python3.4 -c "import pickle; h=open('acc_rep_mat.pik', 'rb');
m=pickle.load(h); h.close(); print(m)"
{('E', 'M'): 33, ...,  ('D', 'V'): 95}

$ python3.4 -c "import pickle; h=open('acc_rep_mat.dos.pik', 'rb');
m=pickle.load(h); h.close(); print(m)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
_pickle.UnpicklingError: the STRING opcode argument must be quoted

So I can get the exact same error under Linux now :)

I confirmed this on Windows where my copy of git is setup to use
Unix newlines by default (I think), and the file has Unix newlines
(and is 3658 bytes).

C:\repositories\biopython\Tests\SubsMat>c:\python34\python
Python 3.4.0 (v3.4.0:04f714765c13, Mar 16 2014, 19:24:06) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> data = open("acc_rep_mat.pik", "rb").read()
>>> with open("acc_rep_mat.dos.pik", "wb") as h: h.write(data.replace(b"\n", b"\r\n"))
...
4289
>>> quit()

C:\repositories\biopython\Tests\SubsMat>c:\python34\python -c  "import
pickle; h=open('acc_rep_mat.pik', 'rb'); m=pickle.load(h); h.close();
print(m)"
{('D', 'R'): 115, ..., ('H', 'Q'): 44}

C:\repositories\biopython\Tests\SubsMat>c:\python34\python -c "import
pickle; h=open('acc_rep_mat.dos.pik', 'rb'); m=pickle.load(h);
h.close(); print(m)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
_pickle.UnpicklingError: the STRING opcode argument must be quoted

So, the upshot is that this git setting change should fix it:
https://github.com/biopython/biopython/commit/b7cc2fe199d22f794612d68e5554361413468372

Could you update your copy of the Biopython source code via git,
and see if that solves this pickle?

Thank you,

Peter


More information about the Biopython-dev mailing list