[Biopython] Reading large files, Biopython cookbook example
Andrew Dalke
dalke at dalkescientific.com
Mon Aug 5 23:09:05 UTC 2013
A bit late, but a bit of background:
> On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa <klexa at umich.edu> wrote:
>> My PDB file came from Maestro, so that is the ordering it follows after 9999.
On Jul 15, 2013, at 7:46 PM, Peter Cock wrote:
> i.e. This software package? http://www.schrodinger.com/productpage/14/12/
>
> Could you contact their support to find out why they are doing this please?
Yes, that's the Maestro Katrina was almost certainly talking about. It's a commercial package which has been around for a while; the company started in 1990 as a commercialization of the Jaguar QM package from Richard Friesner's and William Goddard's labs at CalTech. Maestro is the GUI to their QM and MM codes.
Their conversion routines support various options. See:
https://www.schrodinger.com//AcrobatFile.php?type=supportdocs&type2=&ident=530
The key ones are:
-hex : Use hexadecimal encoding for atom numbers greater
than 99999 and for residue numbers greater than 9999
and
-hybrid36 : Use the hybrid36 scheme for atom serial numbers.
On input, integers of up to 6 digits and hexadecimal numbers are
recognized on ATOM records by default. On output, the default is
to use integers for less than 100 000 atoms, and hexadecimal for
100 000 atoms or more
Annoyingly, as Robert Hanson reported in:
http://www.mailinglistarchive.com/html/jmol-users@lists.sourceforge.net/2013-01/msg00111.html
(and see the thread at)
http://article.gmane.org/gmane.science.chemistry.blue-obelisk/1659/match=pdb+ok+who%27s+wise+guy
their default output generates records like:
ATOM 99998 H1 TIP3W3304 -28.543 60.673 40.064 1.00 0.00 WT5 H
ATOM 99999 H2 TIP3W3304 -27.773 60.376 41.353 1.00 0.00 WT5 H
ATOM 186a0 OH2 TIP3W3305 -24.713 61.533 47.372 1.00 0.00 WT5 O
ATOM 186a1 H1 TIP3W3305 -25.652 61.772 47.519 1.00 0.00 WT5 H
ATOM 186a2 H2 TIP3W3305 -24.713 61.625 46.379 1.00 0.00 WT5 H
which means there can be two atoms with serial numbers "18700" (or "99999", etc) in the same file, with different meanings of what those numbers really mean.
This obviously messes up all of the other PDB annotations which use a serial id, but I presume that most Maestro user only use PDB files for coordinate data, and not for the other fields.
Maestro is the only program I know of which uses this awful form. A default enabling of the "-hybrid36" option (first-digit-is-in-base-36) would make it more consistent with tools in the X-PLOR/VMD heritage does, where A0000 follows 99999. Presumably they want the full 1,048,575 atom range.
> If there are guidelines in the PDB specification for when this field overflows
> I missed them, but it is a problem is there are rival hacks in common use
> (roll-over/wrap-around versus this semi-hex scheme).
There are no specs for how to handle more than 9999 residues, just like there are no specs for how to handle more than 99999 atoms.
Cheers,
Andrew
dalke at dalkescientific.com
More information about the Biopython
mailing list