[Biopython] About Google Summer Code Project PDB-tidy

Fuxiao Xin fuxin at indiana.edu
Thu Apr 8 07:40:36 UTC 2010


hi Eric and Diana,

Thanks for your quick reply.

For the quality/validation problem, thanks Diana for pointing me to the two
resources,  I am surprised that there are so many "problems" defined for PDB
files, and obviously  I underestimate this task, and I think it's a very
interesting problem to study and  I'd like to devote more time on this task,
 I am thinking to make this task the main focus of my first period
coding(before midterm check).  What do you think?

For Eric's responses, please find my reply in line.

My own research needs extensive manipulation of PDB files, and I think  this
>> idea of adding more features to Bio.PDB and more command line options to
>> analyze/present PDB data is excellent. This project is of strong interest
>> to
>> me since it will benefit my own research project as well.
>>
>
> Good to hear. Does your lab have a website? This project requires some
> knowledge of structural biology, so it helps if we can see what specific
> research you've already done in that area.
>

Our lab's website is : http://www.informatics.indiana.edu/predrag/ , and one
main focus of our lab is PTM and disorder, both need to deal with PDB files.
A poster title shows my protein structure-based kernel work:*
http://www.iscb.org/rocky09-program/rocky09-poster-presenters-abstracts,
they didn't put the abstract online. I could send you the abstract if you
are interested.  *


> Programming Skills: I use perl and python during my daily research. I am
>> now
>> working on developing a new functional site predictor using protein
>> structure information. The code will be open source, but the work is under
>> review so the code is not released yet.
>>
>
> Is there any other programming work you've done in the past that you could
> let us see? It doesn't have to be part of an existing open-source project;
> even some functioning snippets posted somewhere would help us get a sense of
> your coding style and abilities. Examples where you've used Biopython or
> another established toolkit for working with PDB files or other scientific
> data would be especially useful.
>
We also like to see that you're familiar with a project's build tools, which
> in Biopython's case is GitHub and the standard Python mechanisms. So, if you
> could upload some of your prior work to GitHub and send us the link, that
> would be ideal.
>

I put some of my python code here:
http://github.com/fuxiaoxin/my_python_code. I don't have code in python
using Bio.PDB. For parsing PDB, my code are in perl for the sake of its
regular expression, I seldomly use bioperl or biopython in the past, I write
all my own code, that's also why I think I am very clear of all kinds of
problems in PDB files. I am quite surprised to find Bio.PDB already have so
many modules for various functions. I could upload some of my perl functions
if you would like to have a look: I have functions similar to PDBparser,
NeighborSearch, DSSP, NACCESS.

I have to say I am not very familiar with the build tools of python. But I
hope to learn it during the bonding period. I just guided myself through to
upload my codes to Github, :)

My project plan:
>>
>> week1
>> 1. Renumber residues starting from 1 (or N)
>> function name: renumberPDB, given a pdb file, rename the atom field
>> numbering of the file to remove missing amino acids
>> communicate with mentors to set standards of the code to follow for the
>> rest
>> of the functions
>> create work log to keep track of process;
>>
>
> Biopython's coding standards generally follow an earlier version of PEP 8;
> hopefully you can pick it up quickly just by reading the source code for
> Bio.PDB -- so you don't really need that item listed here.
>
>
I will learn from Bio.PDB source code and remove this one.


> In the past, students have maintained their weekly schedules on a wiki or
> other public document, and updated them continually throughout the summer.
> This functions as a work log, in a way. You would also have an e-mail record
> of your work from your weekly reports to this list.
>

That's great to know.


> week2-3
>> 2. Select a portion of the structure -- models, chains, etc. -- and write
>> it
>> to a new file (PDB, FASTA, and other formats)
>> function name: rewritePDB, inputs will be a particular portion of a PDB
>> file
>> you want to write out(support 'chain', 'model', 'atom'), a file
>> format(PDB,
>> fasta), and the output name.
>> 3. Perform some basic, well-established measures of model quality/validity
>> function name: PDBquality
>> the function will report RESOLUTION and ? of the structure
>> 4. extract disorder region in PDB structure
>> function name: PDBdisorder
>> report missing residues in the structure atom field
>>
>
> These tasks seem reasonable. You don't need to commit to specific function
> names yet; it would be more helpful to describe the overall module layout
> you're planning, and list the dependencies for each (especially the
> components of Bio.PDB that come into play).
>

I will make a new  proposal with these details by tomorrow.


>
>> week3-4
>> 5. make a function to draw a Ramachandran plot
>> function name: ramaPLOT
>> combine the two steps(calcualting torsion angles and draw the plot) into
>> one
>> function, give the option to draw the plot or not
>>
>
> This task has a number of dependencies which I think you should list and
> describe here. Because of those dependencies there's a significant chance of
> it taking longer than you planned -- so I'd recommend moving it to after the
> midterm evaluations, wherever those fit into your schedule.
>

 I will add more details here.


> week5
>> 6. open PDB files in the window for visulization, visulize PDBsuperpose
>> results, output RMSD
>> function name: superposePDB
>> the function will look like the PDBsuperpose function in matlab; use
>> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
>> visualization tool to see the results
>>
>
> Would you build Python wrappers for interacting with the chosen
> visualization tool, or just write a set of files and launch the viewer in a
> script?
>

I am thinking of launching the script, since those PDB visualization tools
already have very nice command line options and interfaces.  But I think it
is really important to be able to visualize the structure on the
fly, especially when you are doing PDB superimpose.


>  week6
>> 7. write a function to extract all experimental conditions of a PDB file,
>> includes PH, temperature, and salt
>> function name: PDBconditon
>> it will be easy to get PH and temperature information, but for salt, it
>> will
>> be hard to parse because there is no general rule of such information in
>> the
>> PDB file; parse REMARK 200 field;
>>
>
> Sounds handy. Would your script write out a report combining all of this
> info, or just extract requested elements?
>

I am thinking to put the results into a variable instead of a report, since
it will be great for batch processing, and display the results immediately
in interactive mode.

>
> Other obligations:  I am aware that google summer code starts from May
>> 24th,
>> but I will have a review paper with my advisor due on June 1st, I hope it
>> will be OK for me to start after June 1st, and I will makeup the first
>> week
>> in Auguest.
>>
>
> How much of the "community bonding period" will this occupy? The guideline
> is that you get set up with the build system, read documentation and do
> background research part-time between GSoC acceptance and May 24, and start
> writing code full-time on May 24. You can make up for a gap in your project
> plan by doing extra preparation before coding starts; would this be possible
> for you?
>

I think the bonding period will be really important for me to get known
about the python build tools, and of course other stuff you mentors suggest
me to learn,  so I will devote my time for "bonding".  But since I will get
busy near the end of May, I plan to start early and do things more
efficiently.


>
> Finally, the GSoC administration app (socghop.appspot.com) gets crowded as
> the deadline approaches, so it's best if you register yourself there and
> take care of the administrivia as soon as you can to avoid any trouble on
> Friday.
>

Thanks for the reminding. I will incorporate you and Diana's suggestions to
make a new version of proposal, by tomorrow night.  But the idea is,  the
main project for the first period would be the quality/validation task , and
the second period will be the Ramachandran plot.  And I will fill in the
time with other small functions.


Thanks,
Fuxiao



More information about the Biopython mailing list