[Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex

Tue Nov 3 16:36:08 UTC 2009

Hi,

I have found the reason why MMCIParser is dying. It has no provision for more than one model, so when a second model comes around with the same chain and residue the program throws an exception.

I will be joining github to submit the required changes. I haven't used github before, and this is my first open source project so please give me a few days to acclimate.

My mods so far are as follows in MMCIFParser.py (and require the MMCIFlex.py and MMCIF2Dict.py files I will be submitting via github, and have submitted to Peter privately.)

Change the __doc__ setting:
#Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python
__doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" 

Insert the following model_list line:
        occupancy_list=mmcif_dict["_atom_site.occupancy"]
        fieldname_list=mmcif_dict["_atom_site.group_PDB"]
        #Added by Paul T. Bathen Nov 2009
        model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"]
        try:

Make the following changes:
        #Modified by Paul T. Bathen Nov 2009: comment out this line
        #current_model_id=0
        structure_builder=self._structure_builder
        structure_builder.init_structure(structure_id)
        #Modified by Paul T. Bathen Nov 2009: comment out this line
        #structure_builder.init_model(current_model_id)
        structure_builder.init_seg(" ")
        #Added by Paul T. Bathen Nov 2009
        current_model_id = -1

Make the following changes in the for loop:
            #Note by Paul T. Bathen: should this include the HOH and WAT stmts in PDBParser?
            if fieldname=="HETATM":
                hetatm_flag="H"
            else:
                hetatm_flag=" "

            #Added by Paul T. Bathen Nov 2009
            model_id = model_list[i]
            if current_model_id != model_id:
                current_model_id = model_id
                structure_builder.init_model(current_model_id)
            #end of addition

After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2bec.ent and both parsed with the same number of models, chains, and residues. 

The only difference is the PDBParser incorrectly states the first model as 0 when it should be 1: there is an explicit MODEL line in pdb2beg.ent. So all the models are off by one in 2beg when parsed by PDBParser.py. I can look into the bug in PDBParser.py and submit it if desired?

Paul

--- On Mon, 11/2/09, Paul B <tallpaulinjax at yahoo.com> wrote:

From: Paul B <tallpaulinjax at yahoo.com>
Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex
To: biopython-dev at biopython.org
Date: Monday, November 2, 2009, 8:21 AM

I'll use the conventional response technique in future emails! :-)

Hi Peter,

1. "Did you mean to not CC the list?": Sorry, I replied to your email 
address instead of the CC: address! 
2. Peter: "I should be able to run the flex code and you new code side by side,
for testing and profiling. Note sure when I'll find the time exactly, but
we'll see. Examples will help as while I know plenty about PDB files,
I've not used CIF at all": I'd be glad to run the tests myself as well 
and I have the time! :-) But without the flex module installed and 
operational the only way I can think of is with pickle'd .cif dicts.
3. Peter: "P.S. Are you OK with making this contribution under the Biopython
license?" Absolutely I'd be glad to contribute to biopython!

This was in response to my followup email to Peter:
"Hi Peter:

Paul: So I re-wrote MMCIFlex strictly in Python to emulate (the lex based MMCIFlex)

Peter: Now that would be very handy (IMO), if you can get it working.
Have you benchmarked it against the flex code? Have you been able 
to test the flex code? If not, could you give me a tiny script using the 
2beg cif file which should work? If that works, then the problem is in 
your flex replacement code.

Paul: It already works, but I have no way to benchmark it against the
flex code myself. Perhaps someone could pickle a half dozen PDB .cif files and 
send me the resultant files? I can then run a test agains each one. 
I'll also clean up the code on both the new MMCIFlex.py as well as the 
changed MMCIF2Dict.py and send them to you most probably by today. 
Each will have a __main__ method for testing."

--- On Sun, 11/1/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

From: Peter <biopython at maubp.freeserve.co.uk>
Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex
To: "Paul B" <tallpaulinjax at yahoo.com>
Cc: biopython-dev at biopython.org
Date: Sunday, November 1, 2009, 4:28 PM

On Sun, Nov 1, 2009 at 7:50 PM, Paul B <tallpaulinjax at yahoo.com> wrote:
>
> Hi,
>
> I'm a computer science guy trying to figure out some chemistry logic
> to support my thesis, so bear with me! :-) To sum it up, I'm not sure
> MMCIFParser is handling ATOM and MODEL records correctly
> because of this code in MMCIFParser:
>             if fieldname=="HETATM":
>                 hetatm_flag="H"
>             else:
>                 hetatm_flag=" "
> This causes ATOM (and potentially MODEL) records to die as seen
> in the exception below (I think!)

I'll answer that below.

> My questions are:
> 1. Am I correct the correct code is insufficient?
> 2. What additional logic beyond just recognizing whether it's a
> HETATM, ATOM or MODEL record needs to be added?
>
> Thanks!
>
> Paul
>
>
> Background:
> I understand MMCIFlex.py et cetera is commented out in the
> Windows setup.py package due to difficulties compiling it.

It is commented out (on all platforms) because we don't know
how to get setup.py to detect if flex and the relevant headers
are installed, which we would need to compile the code. I'm
note sure how this would work on Windows with an installer
(i.e. what is a run time dependency versus compile time).

> So I re-wrote MMCIFlex strictly in Python to emulate what

Now that would be very handy (IMO), if you can get it working.
Have you benchmarked it against the flex code?

> I THINK the original MMCIFlex did. My version processes
> a .cif file one line at a time (readline()) then passes tokens
> back to MMCIF2Dict at  each call to get_token(). That
> seems to work fine for unit testing of my MMCIFlex and
> MMCIFDict which I had to slightly re-write (to ensure it
> handled how I passed SEMICOLONS line back etc).
>
> However when I try and use this with MMCIFParser
> against the 2beg .cif file which has no HETATM records
> and, as I understand the definition, no disordered atoms
> I get:
>
> ...
>
> Basically what I think MIGHT be happening is MMCIFParser
> is currently only handling HETATM records, when some other
> kind of record comes in (ATOM, MODEL) it is treated
> incorrectly. See below.
>
> ...

Have you been able to test the flex code? If not, could you
give me a tiny script using the 2beg cif file which should
work? If that works, then the problem is in your flex
replacement code.

Peter

_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev