[Biopython] Reading large files, Biopython cookbook example

Andrew Dalke dalke at dalkescientific.com
Mon Aug 5 23:09:05 UTC 2013


A bit late, but a bit of background:

> On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa <klexa at umich.edu> wrote:
>> My PDB file came from Maestro, so that is the ordering it follows after 9999.

On Jul 15, 2013, at 7:46 PM, Peter Cock wrote:
> i.e. This software package? http://www.schrodinger.com/productpage/14/12/
> 
> Could you contact their support to find out why they are doing this please?

Yes, that's the Maestro Katrina was almost certainly talking about. It's a commercial package which has been around for a while; the company started in 1990 as a commercialization of the Jaguar QM package from Richard Friesner's and William Goddard's labs at CalTech. Maestro is the GUI to their QM and MM codes. 

Their conversion routines support various options. See:
  https://www.schrodinger.com//AcrobatFile.php?type=supportdocs&type2=&ident=530

The key ones are:

  -hex : Use hexadecimal encoding for atom numbers greater
    than 99999 and for residue numbers greater than 9999

and

  -hybrid36 : Use the hybrid36 scheme for atom serial numbers.
    On input, integers of up to 6 digits and hexadecimal numbers are
    recognized on ATOM records by default. On output, the default is
    to use integers for less than 100 000 atoms, and hexadecimal for
    100 000 atoms or more


Annoyingly, as Robert Hanson reported in:
  http://www.mailinglistarchive.com/html/jmol-users@lists.sourceforge.net/2013-01/msg00111.html
(and see the thread at)
  http://article.gmane.org/gmane.science.chemistry.blue-obelisk/1659/match=pdb+ok+who%27s+wise+guy

their default output generates records like:

ATOM  99998  H1  TIP3W3304     -28.543  60.673  40.064  1.00  0.00      WT5  H
ATOM  99999  H2  TIP3W3304     -27.773  60.376  41.353  1.00  0.00      WT5  H
ATOM  186a0  OH2 TIP3W3305     -24.713  61.533  47.372  1.00  0.00      WT5  O
ATOM  186a1  H1  TIP3W3305     -25.652  61.772  47.519  1.00  0.00      WT5  H
ATOM  186a2  H2  TIP3W3305     -24.713  61.625  46.379  1.00  0.00      WT5  H

which means there can be two atoms with serial numbers "18700" (or "99999", etc) in the same file, with different meanings of what those numbers really mean.

This obviously messes up all of the other PDB annotations which use a serial id, but I presume that most Maestro user only use PDB files for coordinate data, and not for the other fields.

Maestro is the only program I know of which uses this awful form. A default enabling of the "-hybrid36" option (first-digit-is-in-base-36) would make it more consistent with tools in the X-PLOR/VMD heritage does, where A0000 follows 99999. Presumably they want the full 1,048,575 atom range.


> If there are guidelines in the PDB specification for when this field overflows
> I missed them, but it is a problem is there are rival hacks in common use
> (roll-over/wrap-around versus this semi-hex scheme).

There are no specs for how to handle more than 9999 residues, just like there are no specs for how to handle more than 99999 atoms.

Cheers,


				Andrew
				dalke at dalkescientific.com






More information about the Biopython mailing list