[Biopython] Reading large files, Biopython cookbook example
p.j.a.cock at googlemail.com
Tue Aug 6 09:35:25 UTC 2013
On Tue, Aug 6, 2013 at 12:09 AM, Andrew Dalke <dalke at dalkescientific.com> wrote:
> A bit late, but a bit of background:
>> On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa <klexa at umich.edu> wrote:
>>> My PDB file came from Maestro, so that is the ordering it follows after 9999.
> On Jul 15, 2013, at 7:46 PM, Peter Cock wrote:
>> i.e. This software package? http://www.schrodinger.com/productpage/14/12/
>> Could you contact their support to find out why they are doing this please?
> Yes, that's the Maestro Katrina was almost certainly talking about. It's a
> commercial package which has been around for a while; the company
> started in 1990 as a commercialization of the Jaguar QM package from
> Richard Friesner's and William Goddard's labs at CalTech. Maestro is
> the GUI to their QM and MM codes.
> Their conversion routines support various options. See:
> The key ones are:
> -hex : Use hexadecimal encoding for atom numbers greater
> than 99999 and for residue numbers greater than 9999
> -hybrid36 : Use the hybrid36 scheme for atom serial numbers.
> On input, integers of up to 6 digits and hexadecimal numbers are
> recognized on ATOM records by default. On output, the default is
> to use integers for less than 100 000 atoms, and hexadecimal for
> 100 000 atoms or more
> Annoyingly, as Robert Hanson reported in:
> (and see the thread at)
> their default output generates records like:
> ATOM 99998 H1 TIP3W3304 -28.543 60.673 40.064 1.00 0.00 WT5 H
> ATOM 99999 H2 TIP3W3304 -27.773 60.376 41.353 1.00 0.00 WT5 H
> ATOM 186a0 OH2 TIP3W3305 -24.713 61.533 47.372 1.00 0.00 WT5 O
> ATOM 186a1 H1 TIP3W3305 -25.652 61.772 47.519 1.00 0.00 WT5 H
> ATOM 186a2 H2 TIP3W3305 -24.713 61.625 46.379 1.00 0.00 WT5 H
> which means there can be two atoms with serial numbers "18700" (or
> "99999", etc) in the same file, with different meanings of what those
> numbers really mean.
> This obviously messes up all of the other PDB annotations which use
> a serial id, but I presume that most Maestro user only use PDB files
> for coordinate data, and not for the other fields.
> Maestro is the only program I know of which uses this awful form. A
> default enabling of the "-hybrid36" option (first-digit-is-in-base-36)
> would make it more consistent with tools in the X-PLOR/VMD
> heritage does, where A0000 follows 99999. Presumably they want
> the full 1,048,575 atom range.
>> If there are guidelines in the PDB specification for when this field overflows
>> I missed them, but it is a problem is there are rival hacks in common use
>> (roll-over/wrap-around versus this semi-hex scheme).
> There are no specs for how to handle more than 9999 residues,
> just like there are no specs for how to handle more than 99999 atoms.
> dalke at dalkescientific.com
Thanks Andrew - useful background.
In the long run this problem should go away as the PDB moves
to using the The PDBx/mmCIF format:
More information about the Biopython