[Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py

redmine at redmine.open-bio.org redmine at redmine.open-bio.org
Sun Mar 4 20:21:13 UTC 2012

Issue #2619 has been updated by Eric Talevich.

Lenna Peterson wrote:
> Is the desire to use a C parser due to performance concerns (30k+ line files)?

Presumably. Our PDB parser is pure-Python, and the original author has noted dissatisfaction with its speed. The trade-off is portability, and with PyPy getting faster and more widely usable, the argument for pure Python probably wins now.

> 1) There is at least one python implementation of lex: http://www.dabeaz.com/ply/ (BSD license)

PLY is written entirely in Python, and appears to be supported on all the Python versions we support. I haven't used it, but it looks like a good option.

Not sure if we would need to add PLY as a dependency, or if it generates Python files we could check in to Git and distribute directly.

> 2) The mmCIF parser could possibly be written in core python.

This would probably not be difficult. I'm not sure what to expect in terms of performance between flex, PLY, and manual Python "if" statements and string methods. The mmCIF format looks quite machine-friendly, and I think regular expressions could be mostly avoided. 

Lenna, if you have some time and interest to look into this, the files to modify or replace are:

The options are:

(a) Write (or use PLY to generate) a pure-Python version of the module Bio.PDB.mmCIF.MMCIFlex. This is currently compiled as a C extension, but a Python version of it could be imported as a backup if the C version isn't available.

(b) Modify MMCIF2Dict directly, and implement the state machine there. I suppose you'd have a separate function/method that reads one line at a time from the file, checks the current state and the contents of the line (e.g. line.startswith('#')), updates the state if needed, and emits tokens.

Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py

Author: Chris Oldfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.48

MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py.  According to  


this is because it doesn't compile on Windows.  Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me.

The fix on linux is to uncomment setup.py lines 486 on.  A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance.

Source install of version 1.48, gentoo linux 2008, x86_64.

You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org

More information about the Biopython-dev mailing list