[Bioperl-l] PDB ATOM records: name, segid, etc.

Andrew Dalke dalke@dalkescientific.com
Mon, 15 Jul 2002 09:03:54 -0600


Kris Boulez:
> Forgive me my ignorance (did NMR at university and left for the IT world
> more then ten years ago): but what are these CNS files ? Is this a PDB
> derived structure format ?

CNS is Alex Brunger's replacement for XPLOR.  I understand there's some
licensing issues behind it, but I don't know the details.

> Does someone know where I could find descriptions of 'older' PDB
> formats. The current parser is written based on a document titled
> 'Protein Data Bank Contents Guide: version 2.1 (october 25, 1996)' .
> 
> If so I would certainly add other versions.

I have an old project called "UPDB" which is a parser generator for
PDB files -- given a format description it produces a parser in Python,
Perl, Tcl, and maybe a couple other languages.  Never went anywhere (yet).

It includes a 1.x and 2.x format description, as well as commentary
about the XPLOR differences.

ftp://ftp.ks.uiuc.edu/pub/group/dalke/UPDB-0.5.tar.gz


> At the moment the Atom object is purely a container for info in the PDB
> file. It only knows it's id ('CZ2'), it does not know that it is a
> carbon.
> 
> Problem with these spaces is that people want to be able to say
> 
>   if ( $atom->id eq "CZ2" ) {
> 
> without bothering about the spaces and/or rearrangement of the name in
> the PDB files
> 
> A ->display_id() method, which would give the name as it was in the PDB
> file, might help

You need that to distinguish between " CA " and "CA  " (carbon alpha and
calcium).  Better is to add the proper logic because everyone else will
be confused as well.  And still make the original string accessible in
case you got it wrong. 


> Which other format would people be intrested in ?

In large molecule chemistry, no one I know of has switched to mmCIF.
Nor do I know folk who use CML or MMDB (NCBI's ASN.1 structure format).

In small molecule chemistry there are a few other important formats:
  MDL's 'CT' or 'molfile' formats, which includes SD, RXN, and a few other flavors
    http://www.mdli.com/downloads/ctfile/ctfile_subs.html
    (Format documentation as of a couple years ago had a few errors and
     missing sections.  I have a Martel grammar if you want to look at it.)

  Tripos's 'mol2' file format
    http://www.tripos.com/custResources/mol2Files/
    (This one is about 1/10th as important as molfile support.)

  SMILES -- not really a file format per se but a line notation (a way to
    represent the compound's 2D topology as a short string)
    http://www.daylight.com/smiles/
    (I contributed a readable tokenizer for SMILES as part of Brian Kelley's
     FROWNS project.)

The problem with these is that they require good support for:
   - bond types/orders
   - aromaticity
   - chirality

and if you are only used to dealing with PDB files you likely
won't know how these should be handled.  Plus, the different formats
handle them differently, in a somewhat incompatible fashion.

Another place to look for details on these formats is Babel/OpenBabel
  http://openbabel.sourceforge.net/

					Andrew
					dalke@dalkescientific.com