[Biopython] Reading large files, Biopython cookbook example

Sampson, Jared Jared.Sampson at nyumc.org
Tue Aug 6 20:10:25 UTC 2013


For the curious, there has been a conversation on the CCP4 Bulletin Board over the past few days addressing exactly this topic.  The takeaway message is essentially what Andrew has mentioned: PDB format is here for the foreseeable future.

http://www.mail-archive.com/ccp4bb@jiscmail.ac.uk/msg32321.html

Cheers,
Jared

--
Jared Sampson
Xiangpeng Kong Lab
NYU Langone Medical Center
Old Public Health Building, Room 610
341 East 25th Street
New York, NY 10016
212-263-7898
http://kong.med.nyu.edu/




On Aug 6, 2013, at 2:49 PM, Andrew Dalke <dalke at dalkescientific.com> wrote:

On Aug 6, 2013, at 11:35 AM, Peter Cock wrote:
In the long run this problem should go away as the PDB moves
to using the The PDBx/mmCIF  format:
http://www.wwpdb.org/news/news_2013.html#22-May-2013

Either you are optimistic or a ultra marathon runner! The
move over to mmCIF started of course 20 years ago, and that
link you gave said the change applies only to very large
structures:

   Structures that do not exceed the limitations of the PDB
   format will continue to be provided as PDB files in the
   archive for the foreseeable future.

Even for large files, which previously would split the structure
over multiple records, there will be a "best-effort" PDB format,
available as a web service.


40 years of the PDB format => well-entrenched => not going to
get rid of it any time soon.



For another historical side-note, the PDB format started in
the early 1970s, but contains a kernel which is even older!
Quoting from

 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2143743/pdf/9232661.pdf :

 In order to establish the PDB, acceptance by the crystallographic
 community was necessary, requiring a pilgrimage in 1970 to the Medical
 Research Council (MRC) laboratory and Crystal Data Centre (CDC) in
 Cambridge. One result of this exchange was a concession that coordinates
 of protein structures would be stored in the same format as the small
 molecule CDC database (with a redundant ATOM label at the beginning of
 each card), retaining the now-arcane counting number at the end. But the
 idea of a PDB was accepted by Professors Pemtz, Blow, Kennard, Diamond,
 and colleagues in Cambridge.

The "now-arcane" counting number has long disappeared from the
spec. It was there, I believe, so that if the punch cards were
dropped then they could be resorted based on the last few columns.
(I imagine you could also write a program to strip out the
C-alpha cards, work with them, then merge the C-alphas back into
the card deck correctly.)

Andrew
dalke at dalkescientific.com


_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython





More information about the Biopython mailing list