[Biopython] Biopython & p3d

Wed Oct 21 10:31:38 UTC 2009

On 21 Oct 2009, at 11:18, Peter wrote:

> On Wed, Oct 21, 2009 at 8:25 AM, Christian Fufezan
> <fufezan at uni-muenster.de> wrote:
>> Hello Biopython,
>>
>> we ( Michael Specht & I ) published recently p3d, a python module for
>> structural bioinformatics and were wondering if it wouldn't be a  
>> good good
>> thing if could join the Biopython project. We understand that  
>> Biopython has
>> already a PDB parser but we programmed an alternative version since  
>> we found
>> the Biopython.pdb syntax to be too non-pythonian. One example why  
>> is shown
>> below:
>>
>> Biopython:
>>
>> def test6(structure):
>>        '''get protein surrounding (5) of NAG'''
>>        bucket = set()
>>        atom_list=Selection.unfold_entities(structure,'A')
>>        ns = NeighborSearch(atom_list)
>>        for model in structure.get_list():
>>                for chain in model.get_list():
>>                        for residue in chain.get_list():
>
> I'm not very familiar with the NeighborSearch code, but
> I'm pretty sure the above for loops can be just:
>
> for model in structure:
>    for chain in model:
>        for residue in chain:
>            ...
>
> And regarding detecting oxygen atoms, I think there is
> a patch on bugzilla to record the (relatively) new atom
> column from the PDB file (which will help with Hg and
> mercury versus hydrogen).
>
> Still, I would agree with you that some parts of Bio.PDB
> are not very pythonic - too many functions names get_*()
> which could be replaced with properties. This is something
> we could evolve gradually (add new properties, keep the
> old methods in place but gradually deprecate them).
>
> Specific suggestions would be welcome.

That's maybe the biggest difference between biopython and p3d, which  
will make it difficult to merge the two modules.
A data structure that is build like that of Biopython.pdb imposes  
multiple nested loops and condition queries.
p3ds data structure is not nested and gains strength through  
combination of sets and BSPTree
This allows faster and more flexible looping. Looping over all alpha  
and beta-carbons for example and printing x-coordinates

p3d:
for atom in pdb.query('protein and atom type CB or atom type CA'):
	print atom.x

Still I think both methods could exists side by side. If it is  
efficient - I don't know. Replacing biopythons.pdb parser was never  
the intention and I think it has features that are really good and fast!

>
>> def test6(pdb):
>>        ''' protein surrounding (5) of resname NAG'''
>>        bgl = pdb.query('resname NAG')
>>        bucket = pdb.query('protein and oxygen and within 5 of ',bgl)
>>        print '     found',len(bucket),' oxygens around NAG'
>>        return
>>
>> Certainly, Biopythons PDB module has its advantages and the is no  
>> way p3d
>> could replace it, but both modules have their advantages :) The  
>> fact that
>> biopythons.pdb parser uses a KTree written in C and we wrote one in  
>> python
>> makes certain queries to the protein structure faster in Biopyhton;  
>> however
>> if the query involves more complex demands, multiple loops are  
>> inevitable in
>> biopython, whereas p3d offers a human readable query function that  
>> combines
>> all aspects. The link to our publication is:
>> http://www.biomedcentral.com/1471-2105/10/258
>
> I remember skim reading it a month ago or so. I remember the final  
> line of
> the abstract was a very strong opinion ("a perfect tool"), and I was  
> rather
> surprised the reviewers and editor let you keep it - regardless of  
> any bias
> I might feel to Biopython ;)
>

I guess it was a selling point ;)

>> Looking forward to hear from you, maybe one can also envision a
>> combined module with a new all advantages together.
>
> That would be a good outcome.
>
> From the snippet of code and the examples in the paper, the big  
> feature
> you have that Bio.PDB lacks is "fancy selections", and that is  
> certainly
> something which could be improved in Biopython.
>
Yes that was one thing that we were really missing. Also the fact that  
biopython requires the unfolded entity to be converted to vectors and  
so forth was a bit complex and we needed fast and direct access to the  
coordinates, the very essence of pdb files.

> It is interesting you have implemented (invented?) a string based  
> language
> with logical and, within etc. In some ways it reminds me of the  
> selection
> formulae in VMD - have you used that 3D visualisation tool?
>

Yes I use VMD a lot and the inspiration came certainly from there.
A few things are however unique in p3d, e.g. first residue of chain A  
and p3d supports residue 15 .. 20 to select a range of residues.

Michael has coded the parser that translates the human readable query  
into
set operations and functions and he even implemented a strategy in  
which new functions or query types can be build in in no time. E.g.  
"ligand containing sulfur" could be implemented in 5 min.
He has done truly a great job on this.

> This also reminds me of the SQL language for database selections, and
> how classical SQL code with Python just used SQL statements within
> Python strings. Have you ever used SQLAlchemy, and looked at how
> they handle SQL statements like filters, ands, ors, etc with a clever
> object based interface? Perhaps something like that could work for
> a 3D structure query API.

That certainly sounds very interesting. It would also allow to  
incorporate the actual pdb files into the database
which would reduce loading and tree building times. Surveys, pattern  
screening could be done very fast. One could also imagine
connecting other pdb databases, such as SCOP, Pfam or web services,  
e.g. PISCES.

Regards,

Christian