[Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock
redmine at redmine.open-bio.org
redmine at redmine.open-bio.org
Tue Aug 21 07:21:57 EDT 2012
Issue #3379 has been updated by Peter Cock.
Given Joao's comments, lenience does not sound appropriate in this case.
If the parser's current behaviour is to silently ignore data after an END line, that seems less than ideal.
How about we add a clear error/warning to the parser if there is content in the file after an END line? i.e. Treat it as an exception in strict mode, treat it as a warning in permissive mode (and continue to ignore anything after the END line)?
A sample file would be helpful to verify this, and could even be used for a unit test (with your permission).
----------------------------------------
Bug #3379: PDBParser fails to parse PDBs produced by PatchDock
https://redmine.open-bio.org/issues/3379
Author: David Cain
Status: New
Priority: Low
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.57
URL:
I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs.
h3. Background
Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file.
h3. Why PDBParser fails
Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand.
h3. How to fix the problem
Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files?
h3. Potential change to @PDBParser._parse_coordinates@?
If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure.
If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing.
My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output.
--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org
More information about the Biopython-dev
mailing list