[Biopython-dev] [GSOC] Report - Week 1

Mon Jun 7 20:45:05 EDT 2010

Dear all,

Eric suggested me to write a weekly email wrapping up my progress, any
problems I encountered, new ideas, etc. So, here's week 1 :)

*Proposed Tasks:*
Wiki<http://www.biopython.org/wiki/GSOC2010_Joao#Week_1_.5B31st_May_-_6th_June.5D>
*Project's Github account:*
Link<http://github.com/JoaoRodrigues/biopython/tree/GSOC2010>
*
Progress:*

*1. Renumbering Residues*

I wrote a small function in Structure.py
(link<http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/PDB/Structure.py#L57>)
that iterates over the residues in a chain and subtracts the original first
residue number. This keeps gaps intacts. Worked on my machine for a set of
75 proteins I was working on. Also allows for people to change the starting
residue for whatever reason, the default being 1.

I had originally thought of having a SEQREQ parsing function and using this
as a base for the new renumbering. However, most structures that lack
residues (gaps) still count them in the numbering. Since there is no parser
for SEQRES, I thought this to be the best option.

*Example
*
...
s = p.get_structure('a', '2KSX.pdb')
s.renumber_residues()
s.renumber_residues(start=0)

*2. Disulphide bond search*

I originally proposed to use the NeighborSearch method but I didn't know
that subtracting two atom objects gave me their distance. I used this
instead.

I defined a threshold of 3A for a S-S since the average is 2.05A. I tried to
get some paper/doc from other software where such a limit would be already
defined but I didn't find any.. thus, I assigned 3 because its results
agreed with the SSBOND records. The user can provide a threshold integer or
float as an argument to make the search stricter or broader.

The function generates first an iterator with all the pairs of cysteines
possible in the protein. It then checks and yields those with distances
between the SG atoms of the cystein below the threshold. The result is also
an iterator with tuples containing pairs of residue objects.

*Example*

...
s = p.get_structure('a', '2KSX.pdb')
[i for i in s.search_ss_bonds()]
[(<Residue CYS het=  resseq=1 icode= >, <Residue CYS het=  resseq=98 icode=
>), (<Residue CYS het=  resseq=29 icode= >, <Residue CYS het=  resseq=138
icode= >), (<Residue CYS het=  resseq=12 icode= >, <Residue CYS het=
resseq=95 icode= >), (<Residue CYS het=  resseq=58 icode= >, <Residue CYS
het=  resseq=66 icode= >), (<Residue CYS het=  resseq=180 icode= >, <Residue
CYS het=  resseq=200 icode= >)]
len([i for i in s.search_ss_bonds(threshold=100)])
45

*Problems:*

*3. Biological Unit*

I added code to parse_pdb_header to extract the REMARK 350 section. They
contain something like this (1IHM.pdb<http://www.pdb.org/pdb/files/1IHM.pdb>
):

REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000
0.00000
REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000
0.00000
REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000
0.00000
REMARK 350   BIOMT1   2  0.500000 -0.809017 -0.309017
0.00000
REMARK 350   BIOMT2   2  0.809017  0.309017  0.500000
0.00000
REMARK 350   BIOMT3   2 -0.309017 -0.500000  0.809017
0.00000
REMARK 350   BIOMT1   3 -0.309017 -0.500000 -0.809017        0.00000

I parse out the 4th column to identify each transformation. I store a 3x3
rotation matrix and the translation vector separately. It is then easy to
apply them to each atom record via the transform function.

Now, the problem lies in what the output should be. We broke it down to two
main options:

a. Create a new structure object for each rotated/translated object, thus
making the final output a list of structures. This takes quite a while
actually. I tried this with a deepcopy method to copy each structure and it
took over 30 seconds on my machine for that PDB file above.

b. Add the new rotated objects as new chains in the original structure. This
is actually a good solution because it allows people to use other methods
(the SS search comes to mind) on quartenary structures. It also allows the
user to write a file with all the structures in their place using PDBIO
quite seamlessly. However, it might be complicated to deal with an excess of
chains, or if not all chains are supposed to be rotated (dunno if the case
actually exists).

My personal belief is that B is the way to go. Although it adulterates the
original structure with alien chains, it allows much greater flexibility. I
haven't tested it though.

----

Comments? :)

João [...] Rodrigues