[Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex

Sun Nov 1 14:50:31 EST 2009

Hi,

I'm a computer science guy trying to figure out some chemistry logic to support my thesis, so bear with me! :-) To sum it up, I'm not sure MMCIFParser is handling ATOM and MODEL records correctly because of this code in MMCIFParser:
            if fieldname=="HETATM":
                hetatm_flag="H"
            else:
                hetatm_flag=" "
This causes ATOM (and potentially MODEL) records to die as seen in the exception below (I think!)

My questions are:
1. Am I correct the correct code is insufficient?
2. What additional logic beyond just recognizing whether it's a HETATM, ATOM or MODEL record needs to be added?

Thanks!

Paul

Background:
I understand MMCIFlex.py et cetera is commented out in the Windows setup.py package due to difficulties compiling it. So I re-wrote MMCIFlex strictly in Python to emulate what I THINK the original MMCIFlex did. My version processes a .cif file one line at a time (readline()) then passes tokens back to MMCIF2Dict at  each call to get_token(). That seems to work fine for unit testing of my MMCIFlex and MMCIFDict which I had to slightly re-write (to ensure it handled how I passed SEMICOLONS line back etc).

However when I try and use this with MMCIFParser against the 2beg .cif file which has no HETATM records and, as I understand the definition, no disordered atoms I get:

C:\Python25\Lib\site-packages\Bio\PDB\StructureBuilder.py:85: PDBConstructionWar
ning: WARNING: Chain A is discontinuous at line 0.
  PDBConstructionWarning)
C:\Python25\Lib\site-packages\Bio\PDB\StructureBuilder.py:122: PDBConstructionWa
rning: WARNING: Residue (' ', 17, ' ') redefined at line 0.
  PDBConstructionWarning)
Traceback (most recent call last):
  File "MMCIFParser.py", line 140, in <module>
    structure=p.get_structure("test", filename)
  File "MMCIFParser.py", line 23, in get_structure
    self._build_structure(structure_id)
  File "MMCIFParser.py", line 88, in _build_structure
    icode)
  File "C:\Python25\lib\site-packages\Bio\PDB\StructureBuilder.py", line 148, in
 init_residue
    % (resname, field, resseq, icode))
PDBExceptions.PDBConstructionException: Blank altlocs in duplicate residue LEU (
' ', 17, ' ')

Basically what I think MIGHT be happening is MMCIFParser is currently only handling HETATM records, when some other kind of record comes in (ATOM, MODEL) it is treated incorrectly. See below.

MMCIFParser.py
    def _build_structure(self, structure_id):
 ....
        fieldname_list=mmcif_dict["_atom_site.group_PDB"]
 ....
        for i in xrange(0, len(atom_id_list)):
     ...
            altloc=alt_list[i]
            if altloc==".":
                altloc=" "
     ...
            fieldname=fieldname_list[i]
            #How are ATOM and MODEL records handled?
            if fieldname=="HETATM":
                hetatm_flag="H"
            else:
                hetatm_flag=" "
            if current_chain_id!=chainid:
                current_chain_id=chainid
                structure_builder.init_chain(current_chain_id)
                current_residue_id=resseq
                icode, int_resseq=self._get_icode(resseq)
                #This is line 87-88 in the real file
                structure_builder.init_residue(resname, hetatm_flag, int_resseq, 
                    icode)

Class StructureBuilder:
    ...
    def init_residue(self, resname, field, resseq, icode):
        if field!=" ":
            if field=="H":
                # The hetero field consists of H_ + the residue name (e.g. H_FUC)
                field="H_"+resname 
        res_id=(field, resseq, icode) 
       ...
        #This line will get executed for any non-HETATM record (ie ATOM Or MODEL)
        #because in MMCIFParser, if it wasn't a HETATM, then it's a ' ' ???
        if field==" ":
            if self.chain.has_id(res_id):
=======>But there are no point mutations in 2beg that I know of. Shouldn't be here!
                # There already is a residue with the id (field, resseq, icode).
                # This only makes sense in the case of a point mutation.
                if __debug__:
                    warnings.warn("WARNING: Residue ('%s', %i, '%s') "
                                  "redefined at line %i."
                                  % (field, resseq, icode, self.line_counter),
                                  PDBConstructionWarning)
                duplicate_residue=self.chain[res_id]
                if duplicate_residue.is_disordered()==2:
                    # The residue in the chain is a DisorderedResidue object.
                    # So just add the last Residue object. 
                    if duplicate_residue.disordered_has_id(resname):
                        # The residue was already made
                        self.residue=duplicate_residue
                        duplicate_residue.disordered_select(resname)
                    else:
                        # Make a new residue and add it to the already
                        # present DisorderedResidue
                        new_residue=Residue(res_id, resname, self.segid)
                        duplicate_residue.disordered_add(new_residue)
                        self.residue=duplicate_residue
                        return
                else:
                    # Make a new DisorderedResidue object and put all
                    # the Residue objects with the id (field, resseq, icode) in it.
                    # These residues each should have non-blank altlocs for all their atoms.
                    # If not, the PDB file probably contains an error. 
====>          #This is the line throwing the exception, but we shouldn't be here!
                    if not self._is_completely_disordered(duplicate_residue):
                        # if this exception is ignored, a residue will be missing
                        self.residue=None
                        raise PDBConstructionException(\
                            "Blank altlocs in duplicate residue %s ('%s', %i, '%s')" \
                            % (resname, field, resseq, icode))
                    self.chain.detach_child(res_id)
                    new_residue=Residue(res_id, resname, self.segid)
                    disordered_residue=DisorderedResidue(res_id)
                    self.chain.add(disordered_residue)
                    disordered_residue.disordered_add(duplicate_residue)
                    disordered_residue.disordered_add(new_residue)
                    self.residue=disordered_residue
                    return
        residue=Residue(res_id, resname, self.segid)
        self.chain.add(residue)
        self.residue=residue