[Biopython] Reading large files, Biopython cookbook example

Mon Jul 15 17:37:19 UTC 2013

On Jul 14, 2013, at 12:42 PM, Nick Lindberg <nlindberg at mkei.org<mailto:nlindberg at mkei.org>> wrote:

If your hex character is in a variable "residue" then:

decimal_conversion = int(residue, 16)

will turn A000 into 10000, A001 into 10001, etc.

Actually, int("A000",16) returns 40960, because it's treating the entire string as a hexadecimal number.  Since it seems to be only the first digit that is altered because of the overflow, it may be better to do a string substitution with a regular expression.  Based on the accepted answer at http://stackoverflow.com/questions/937697/, the following lines will replace any alpha character with its value from a dict object. (Just add more items to the dict to cover the overflow residue range.)

###
import re

# the residue number
r = "A000"

# the replacement dict
d = {'A' : '10',
     'B' : '11',
     'C' : '12'} # and so forth

# match uppercase alpha characters
x = re.compile('[A-Z]')

print x.sub(lambda m: d[m.group()], r)
###

I hope that's helpful.

Cheers,
Jared

--
Jared Sampson
Xiangpeng Kong Lab
NYU Langone Medical Center
Old Public Health Building, Room 610
341 East 25th Street
New York, NY 10016
212-263-7898
http://kong.med.nyu.edu/

In your case, since you
know it doesn't go to hex until after 9999 (and so that it will start with
a letter) you could use an identifier to check if the first character is a
letter or not, then convert it.

>From there, you could either subtract 10000 to have it wrap properly, or
fix Biopython to read the correct values.  (You could either do this on
the fly in Biopython, or write a script to convert your residue file.)

Let me know if you'd like some help.

Thanks--

Nick Lindberg
Sr. Consulting Engineer, HPC
Milwaukee Institute
414.727.6413 (W)
http://www.mkei.org

On 7/14/13 6:21 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
Hi everyone,

I'm trying to do something that seems like it ought to be super simple,
since it is on the Biopython wiki cookbook
(http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
that script will not work for me.

When I try to run it as it is, on a pdb file that has more than 10000
residues, I get the "NameError: global name 'Residue' is not defined" at
line 77. My assumption was that maybe the script needed to import some
other
module from Biopython, so I added from Bio.PDB import * to the top of
the
script, but then it failed with "TypeError: 'str' object is not
callable" at
line 73 (residue = Residue(res_id, resname, self.segid). I tried to
circumvent this by just changing the name of the variable being created,
from residue = Residue to foobar = Residue (and then carrying that
naming
through), but I continued to get the TypeError. Has anyone seen this
before
and/or can anyone help me out getting this to run.

I have a file where all of the residues after 9999 are numbered starting
with A000, and that causes the normal Bio.PDB.PDBParser to crash with
invalid literal for int() with base 10: 'A000', so if there is an easier
work around for that, that would also be a solution.

Thank you so much for your help!

It seems that the wiki example assumes the residues numbers
wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
is going from 9999 to A000, A001, etc which I've not seen before.

Where did your PDB file come from? A public database?
Another tool?

Peter
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython