[Biopython-dev] [GSOC] Report - Week 1
João Rodrigues
anaryin at gmail.com
Mon Jun 7 23:42:27 EDT 2010
Just my own heads up and comment.
I thought of using MODEL records to hold the rotated structures. Citing the
PDB format guidelines:
This record is used only when more than one model appears in an entry.
*Generally,
> it is employed mainly for NMR structures.* The chemical connectivity
> should be the same for each model. ATOM, HETATM, ANISOU, and TER records for
> each model structure and are interspersed as needed between MODEL and ENDMDL
> records.
>
Since REMARK 350 seems to be a X-Ray exclusive feature and conversely MODEL
a NMR one, I believe this could also be a possible solution. I'm adding the
code I wrote to Git. There is a huge speed problem with that deepcopy
method.. if someone has a faster/better alternative, it would be great as
this takes around 2 seconds per matrix.
Best!
João [...] Rodrigues
@ http://www.biopython.org/wiki/User:Joaor
On Mon, Jun 7, 2010 at 7:45 PM, João Rodrigues <anaryin at gmail.com> wrote:
> Dear all,
>
> Eric suggested me to write a weekly email wrapping up my progress, any
> problems I encountered, new ideas, etc. So, here's week 1 :)
>
> *Proposed Tasks:* Wiki<http://www.biopython.org/wiki/GSOC2010_Joao#Week_1_.5B31st_May_-_6th_June.5D>
> *Project's Github account:* Link<http://github.com/JoaoRodrigues/biopython/tree/GSOC2010>
> *
> Progress:*
>
> *1. Renumbering Residues*
>
> I wrote a small function in Structure.py (link<http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/PDB/Structure.py#L57>)
> that iterates over the residues in a chain and subtracts the original first
> residue number. This keeps gaps intacts. Worked on my machine for a set of
> 75 proteins I was working on. Also allows for people to change the starting
> residue for whatever reason, the default being 1.
>
> I had originally thought of having a SEQREQ parsing function and using this
> as a base for the new renumbering. However, most structures that lack
> residues (gaps) still count them in the numbering. Since there is no parser
> for SEQRES, I thought this to be the best option.
>
> *Example
> *
> ...
> s = p.get_structure('a', '2KSX.pdb')
> s.renumber_residues()
> s.renumber_residues(start=0)
>
>
> *2. Disulphide bond search*
>
> I originally proposed to use the NeighborSearch method but I didn't know
> that subtracting two atom objects gave me their distance. I used this
> instead.
>
> I defined a threshold of 3A for a S-S since the average is 2.05A. I tried
> to get some paper/doc from other software where such a limit would be
> already defined but I didn't find any.. thus, I assigned 3 because its
> results agreed with the SSBOND records. The user can provide a threshold
> integer or float as an argument to make the search stricter or broader.
>
> The function generates first an iterator with all the pairs of cysteines
> possible in the protein. It then checks and yields those with distances
> between the SG atoms of the cystein below the threshold. The result is also
> an iterator with tuples containing pairs of residue objects.
>
> *Example*
>
> ...
> s = p.get_structure('a', '2KSX.pdb')
> [i for i in s.search_ss_bonds()]
> [(<Residue CYS het= resseq=1 icode= >, <Residue CYS het= resseq=98 icode=
> >), (<Residue CYS het= resseq=29 icode= >, <Residue CYS het= resseq=138
> icode= >), (<Residue CYS het= resseq=12 icode= >, <Residue CYS het=
> resseq=95 icode= >), (<Residue CYS het= resseq=58 icode= >, <Residue CYS
> het= resseq=66 icode= >), (<Residue CYS het= resseq=180 icode= >, <Residue
> CYS het= resseq=200 icode= >)]
> len([i for i in s.search_ss_bonds(threshold=100)])
> 45
>
>
>
> *Problems:*
>
> *3. Biological Unit*
>
> I added code to parse_pdb_header to extract the REMARK 350 section. They
> contain something like this (1IHM.pdb<http://www.pdb.org/pdb/files/1IHM.pdb>
> ):
>
> REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000
> 0.00000
> REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000
> 0.00000
> REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000
> 0.00000
> REMARK 350 BIOMT1 2 0.500000 -0.809017 -0.309017
> 0.00000
> REMARK 350 BIOMT2 2 0.809017 0.309017 0.500000
> 0.00000
> REMARK 350 BIOMT3 2 -0.309017 -0.500000 0.809017
> 0.00000
> REMARK 350 BIOMT1 3 -0.309017 -0.500000 -0.809017 0.00000
>
> I parse out the 4th column to identify each transformation. I store a 3x3
> rotation matrix and the translation vector separately. It is then easy to
> apply them to each atom record via the transform function.
>
> Now, the problem lies in what the output should be. We broke it down to two
> main options:
>
> a. Create a new structure object for each rotated/translated object, thus
> making the final output a list of structures. This takes quite a while
> actually. I tried this with a deepcopy method to copy each structure and it
> took over 30 seconds on my machine for that PDB file above.
>
> b. Add the new rotated objects as new chains in the original structure.
> This is actually a good solution because it allows people to use other
> methods (the SS search comes to mind) on quartenary structures. It also
> allows the user to write a file with all the structures in their place using
> PDBIO quite seamlessly. However, it might be complicated to deal with an
> excess of chains, or if not all chains are supposed to be rotated (dunno if
> the case actually exists).
>
> My personal belief is that B is the way to go. Although it adulterates the
> original structure with alien chains, it allows much greater flexibility. I
> haven't tested it though.
>
> ----
>
> Comments? :)
>
> João [...] Rodrigues
>
>
More information about the Biopython-dev
mailing list