[Biopython] About Google Summer Code Project PDB-tidy

Eric Talevich eric.talevich at gmail.com
Thu Apr 8 03:48:08 UTC 2010


Hi Fuxiao,

Thanks for your interest in this project. I see you've been working on this
proposal for awhile already, so although the submission deadline is very
close, I think you'll still be OK. I've interleaved my comments with your
proposal below:

On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin <fuxin at umail.iu.edu> wrote:

> Dear all,
>
> I am a third year Phd student in Bioinformatics from Indiana University
> Bloomington.  I am very in interested in the google summer code project of
> biopython "PDB-Tidy: command-line tools for manipulating PDB files".
>
> My own research needs extensive manipulation of PDB files, and I think
>  this
> idea of adding more features to Bio.PDB and more command line options to
> analyze/present PDB data is excellent. This project is of strong interest
> to
> me since it will benefit my own research project as well.
>

Good to hear. Does your lab have a website? This project requires some
knowledge of structural biology, so it helps if we can see what specific
research you've already done in that area.

Programming Skills: I use perl and python during my daily research. I am now
> working on developing a new functional site predictor using protein
> structure information. The code will be open source, but the work is under
> review so the code is not released yet.
>

Is there any other programming work you've done in the past that you could
let us see? It doesn't have to be part of an existing open-source project;
even some functioning snippets posted somewhere would help us get a sense of
your coding style and abilities. Examples where you've used Biopython or
another established toolkit for working with PDB files or other scientific
data would be especially useful.

We also like to see that you're familiar with a project's build tools, which
in Biopython's case is GitHub and the standard Python mechanisms. So, if you
could upload some of your prior work to GitHub and send us the link, that
would be ideal.


My project plan:
>
> week1
> 1. Renumber residues starting from 1 (or N)
> function name: renumberPDB, given a pdb file, rename the atom field
> numbering of the file to remove missing amino acids
> communicate with mentors to set standards of the code to follow for the
> rest
> of the functions
> create work log to keep track of process;
>

Biopython's coding standards generally follow an earlier version of PEP 8;
hopefully you can pick it up quickly just by reading the source code for
Bio.PDB -- so you don't really need that item listed here.

In the past, students have maintained their weekly schedules on a wiki or
other public document, and updated them continually throughout the summer.
This functions as a work log, in a way. You would also have an e-mail record
of your work from your weekly reports to this list.

week2-3
> 2. Select a portion of the structure -- models, chains, etc. -- and write
> it
> to a new file (PDB, FASTA, and other formats)
> function name: rewritePDB, inputs will be a particular portion of a PDB
> file
> you want to write out(support 'chain', 'model', 'atom'), a file format(PDB,
> fasta), and the output name.
> 3. Perform some basic, well-established measures of model quality/validity
> function name: PDBquality
> the function will report RESOLUTION and ? of the structure
> 4. extract disorder region in PDB structure
> function name: PDBdisorder
> report missing residues in the structure atom field
>

These tasks seem reasonable. You don't need to commit to specific function
names yet; it would be more helpful to describe the overall module layout
you're planning, and list the dependencies for each (especially the
components of Bio.PDB that come into play).


> week3-4
> 5. make a function to draw a Ramachandran plot
> function name: ramaPLOT
> combine the two steps(calcualting torsion angles and draw the plot) into
> one
> function, give the option to draw the plot or not
>

This task has a number of dependencies which I think you should list and
describe here. Because of those dependencies there's a significant chance of
it taking longer than you planned -- so I'd recommend moving it to after the
midterm evaluations, wherever those fit into your schedule.

week5
> 6. open PDB files in the window for visulization, visulize PDBsuperpose
> results, output RMSD
> function name: superposePDB
> the function will look like the PDBsuperpose function in matlab; use
> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other
> visulization tool to see the results
>

Would you build Python wrappers for interacting with the chosen
visualization tool, or just write a set of files and launch the viewer in a
script?


> week6
> 7. write a function to extract all experimental conditions of a PDB file,
> includes PH, temperature, and salt
> function name: PDBconditon
> it will be easy to get PH and temperature information, but for salt, it
> will
> be hard to parse because there is no general rule of such information in
> the
> PDB file; parse REMARK 200 field;
>

Sounds handy. Would your script write out a report combining all of this
info, or just extract requested elements?


> week7-8
> 8. extract PTM,
> function name: PDBptm
> difficult: the Post-translational modification annotation in PDB is not
> consistant, need to make a list of PTMs to work on
> parse MODRES field
>
> week9-10
> 9. extract ligand binding information
> function name: PDBligand
> parse HETNAM field
>

Good. Some of these later items sound straightforward enough that it would
be better to tackle them earlier in the summer.


> Other obligations:  I am aware that google summer code starts from May
> 24th,
> but I will have a review paper with my advisor due on June 1st, I hope it
> will be OK for me to start after June 1st, and I will makeup the first week
> in Auguest.
>

How much of the "community bonding period" will this occupy? The guideline
is that you get set up with the build system, read documentation and do
background research part-time between GSoC acceptance and May 24, and start
writing code full-time on May 24. You can make up for a gap in your project
plan by doing extra preparation before coding starts; would this be possible
for you?

Finally, the GSoC administration app (socghop.appspot.com) gets crowded as
the deadline approaches, so it's best if you register yourself there and
take care of the administrivia as soon as you can to avoid any trouble on
Friday.

Best regards,
Eric



More information about the Biopython mailing list