[Bioperl-l] Parsing PDB entries in BioPerl

Jonathan Epstein Jonathan_Epstein@nih.gov
Tue, 13 Nov 2001 14:57:18 -0500


Kris,

I suggest that you ponder NCBI's MMDB:
   http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml

In particular, you may want to consider the following:
(a) I don't know how well MMDB is currently maintained and how much data consistency issues still exist in PDB, but in principle MMDB includes a cleaned-up version of PDB
(b) There is a C API in the NCBI toolkit to access this data.  If the API is of interest to you, maybe we could work together on using it via SWIG (something that I haven't looked at for a few months)
(c) there are ASN.1 structures contained in biostruc/mmdb[123].asn in the NCBI toolkit, which itself appears at:
   ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz
these are probably worth studying.  Unfortunately, I don't see any XML equivalents, which I assume you would prefer to parse.  Either way, ASN.1 or XML parsing is much better/safer/more stable than parsing a flat file format.
(d) some of the code (objmmdb*.c, as I recall) has the effect of "gutting" portions of the ASN.1, and only sending a small subset of the data over the net.  Even if you choose not to use the "gutting" functionality, you may wish to study it to get some hints on what data you might wish to frequently exclude from your data models.

Sorry that I don't have any ideas on how to incorporate this into BioPerl's data model.

HTH,

-Jonathan

At 01:10 PM 11/13/2001 , Kris Boulez wrote:
>As I found myself writing ad-hoc scripts to get certain data out of a
>PDB entry, I've decided to write a PDB parser for BioPerl. 
>
>The idea is to parse every line in the entry and to have access to all
>the data via some Bio:: object. The work on the SeqIO parser (Bio::SeqIO::pdb)
>is progressing nicely.
>
>For the moment I'm working on parsing all the different 'records' (PDBspeak
>for different lines) and not so much on how to store the info in a Bio:: 
>object (references are already stored in Bio::Annotation::Reference objects).
>The moment to start thinking abouth 'how' to store 'what' inside 'which' 
>Bio::* object has arrived. 
>
>My first thought was to inherit from a Bio::Seq object, but this does
>not seem to be the right approach
>   - which sequence to store (the one from Swiss-Prot)
>   - not every residue has coordinates (C,N terminal)
>   - PDB entries can consist of multiple 'chains' (i.e. a complex of two
>     proteins)
>   - how to handle post-translational modifications
>   - there is no easy access to the data that makes PDB special (x,y,z
>     coordinates, ...)
>   - how to handle 'models' (structures determined by NMR, do not consist
>     of one, but multiple entries).
>
>This suggests that a new type of object might be needed. To start
>thinking about this I think it might be good to think about how the user
>might use this object (i.e. 'which questions would you ask ?). So
>therefor I would want to ask you which data in a PDB entry you're
>typically intrested in and which questions you want to ask to such an
>object.
>
>
>
>Kris,
>-- 
>Kris Boulez                             Tel: +32-9-241.11.00
>AlgoNomics NV                           Fax: +32-9-241.11.02
>Technologiepark 4                       email: kris.boulez@algonomics.com
>B 9052 Zwijnaarde                       http://www.algonomics.com/
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l



Jonathan Epstein                                Jonathan_Epstein@nih.gov
Head, Unit on Biologic Computation              (301)402-4563
Office of the Scientific Director               Bldg 31, Room 2A47
Nat. Inst. of Child Health & Human Development  31 Center Drive
National Institutes of Health                   Bethesda, MD 20892