[Bioperl-l] PDB ATOM records: name, segid, etc.

Kris Boulez kris.boulez@algonomics.com
Mon, 15 Jul 2002 10:20:44 +0200


[ It's nice to see that people are using these modules and have comments
on them. I'm aware that PDB writing isn't perfect at the moment. ]

I'll give a lightning talk about Bio::Structure at BOSC 02. We might
also discus this in more depth there.


Quoting Andrew Dalke (adalke@mindspring.com):
> Joe Krahn:
> > Although SEGID is depricated by the official PDB standard,
> > it is useful to me because I work with CNS files. Would people
> > be opposed to supporting it in BioPerl? (Note- most crystallographers
> > want to keep the SEGID. It is a useful thing, especially now
> > that PDB disallows CHAINID for ligands.)

Forgive me my ignorance (did NMR at university and left for the IT world
more then ten years ago): but what are these CNS files ? Is this a PDB
derived structure format ?

> 
> I say keep it.  You'll need it for old PDB files as well.  'Course,
> you'll also need the logic to distinguish between the different
> versions.
> 
Does someone know where I could find descriptions of 'older' PDB
formats. The current parser is written based on a document titled
'Protein Data Bank Contents Guide: version 2.1 (october 25, 1996)' .

If so I would certainly add other versions.

> > Another useful but non-standard optional feature is a 4th residue
> > character. It can be useful for designating variants of a residue,
> > like HISD for HIS protonated at ND.
> 
> Also very important.  Great for things like "TIP3" waters.  (BTW,
> I have an XPLOR background.  :)
> 
 (see below)

> > The first two letters are always the element. The aliegnment seems
> > strange until you realize this. Think of carbon being represented
> > by " C". A leading non-letter character is allowed for atom names
> > that are too long, mostly hydrogens. The current pdb.pm shifts
> > <number>H correctly (a good guess) but will get all 2-letter elements
> > wrong. "CA  " for calcium will become " CA ", a carbon atom.
> 
> Um.  The letters of the first two characters are always the element.
> Unless they aren't.  I've seen "U" used for "Unknown".  And then there's
> dealing with all the programs which has a different interpretation of that
> field.  Roger Sayle gave a talk about this early last year.
> 
> http://www.daylight.com/meetings/mug01/Sayle/m4xbondage.html
> ] 2c. All atom records containing " Q" as the first two characters of the
> ] atom name were ignored. The element " Q" is commonly used in NMR processing
> ] to represent pseudo atoms used in refinement
> 
> > So, if pdb.pm is going to remove the leading space on
> > atom names (technically wrong, but probably desirable for many people)
> 
> That isn't technically wrong.  It depends on the context.  " C" is the
> representation of "carbon" in the PDB file.  Internally, bioperl could store
> it as "carbon" or "12" or "mixelja" so long as it is consistent and
> captures the data model correctly.
> 
At the moment the Atom object is purely a container for info in the PDB
file. It only knows it's id ('CZ2'), it does not know that it is a
carbon.

Problem with these spaces is that people want to be able to say

  if ( $atom->id eq "CZ2" ) {

without bothering about the spaces and/or rearrangement of the name in
the PDB files

A ->display_id() method, which would give the name as it was in the PDB
file, might help

> > then reading an ATOM needs to generate the element entry when an ATOM
> > doesn't include it. This can also be a problem - a PDB file with
> > no element entries and improper atom alignment will generate bad
> > element entries, but at least it works for all single-letter elements.
> 
> That's a hard task.  See section 3 in Roger's paper.
> 
> A question I have is:  Do you want a faithful representation of the data in the
> PDB (in which case a missing element field is left missing) or do you want
> a translation to another model of chemistry (as Roger does for SMILES)?
> 

Which other format would people be intrested in ?

Kris,
-- 
Kris Boulez 				Tel: +32-9-241.11.00
AlgoNomics NV 				Fax: +32-9-241.11.02
Technologiepark 4 			email: kris.boulez@algonomics.com
B 9052 Zwijnaarde 			http://www.algonomics.com/