[Biopython] Biopython & p3d

Christian Fufezan fufezan at uni-muenster.de
Wed Oct 21 18:22:48 UTC 2009


>> A data structure that is build like that of Biopython.pdb imposes
>> multiple nested loops and condition queries.
>
> Not really - see below.

if things get more complicated, there might be a need ....

>> p3ds data structure is not nested and gains strength through  
>> combination
>> of sets and BSPTree
>> This allows faster and more flexible looping. Looping over all  
>> alpha and
>> beta-carbons for example and printing x-coordinates
>>
>> p3d:
>> for atom in pdb.query('protein and atom type CB or atom type CA'):
>>        print atom.x
>
> The Bio.PDB structure, model or chain object do offer direct access
> to a "flat" list of atoms via the get_atoms() method. e.g.
>
> from Bio import PDB
> structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb")
> for atom in structure.get_atoms() :
> 	if atom.name in ["CA", "CB"] : print atom.coord
>
> (I'd have to think a bit longer about how in general to restrict  
> this to
> proteins, here that is implicit since CA and CB are protein specific)
>

That would be the second condition to check ... if the search should  
be limited to certain atoms of chain A and C then one would require  
another check. Personally, I can not see the advantages of a nested  
structure, but then I am not an expert.

> You can also of course use a list comprehension, e.g. to get all
> the x-coordinates (which I guess is what your example does),
>
> from Bio import PDB
> structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb")
> x_list = [atom.coord[0] for atom in structure.get_atoms() \
>             if atom.name in ["CA", "CB"]]
>
> You can also drill down through the nested structure of models,
> chains and residues to get to the atoms that way.
>
> To me these are more Pythonic than the clever natural language
> parsing in p3d (which seems ideal for a user interface, rather than
> a programming API).

That is, I guess, a matter of taste. I am happy if an API helps me to  
reach my goal fast.
x_list = [atom.x for atom in pdb.query('protein and atom type CB or  
atom type CA')]
seems more intuitive and clearer than atom.coord[0] for atom in  
structure.get_atoms() if atom.name in ["CA", "CB"].
But I guess that's a matter of taste. Pythonian for me is readable  
source code. But again, that's a matter of taste.

If things get more complex than the power of a human readable  
interface becomes clearer.
For example consider you want to get all ALAs that are within a  
distance range of a point in space.
in p3d, one can define the point in space by a p3d.vector.Vector, lets  
say V1 and then form a query
similar to "within 20 of V1 and not within 10 of V1".

Or all proteinogenic oxygens that are not part of the backbone and  
within 4 Å of a ligand, e.g. ATP.
without knowing what kind of oxygens these could be (i.e. OG1, OG,  
OE1, OD1, OD2, OE2)
one can easily formulate a query in the form of "protein and oxygen  
and not backbone and within 4 of resname ATP"

The query can actually also be resolved to a set of set operations e.g.
for atom in pdb.hash["resid"][20] & pdb.hash["oxygen"][""]:
but the query function is simply to convenient ;)

> Biopython might be improved by defining an
> atoms property (list or iterator?) instead of the get_atoms() method.
>
agree.  I would argue that p3d's atom/vector class seems the way to go.

> One might also ask for x, y and z properties on the atom object
> to provide direct access to the three coordinates as floats. Do
> you think this sort of little thing would help improve Bio.PDB?
>
yes indeed, that is _the_ information a pdb module should offer  
without any addition.
Better would be even if the atoms are treatable as vectors (see below).
p3d has a series of atom object attributes that are convenient.

>> Still I think both methods could exists side by side. If it is  
>> efficient - I
>> don't know. Replacing biopythons.pdb parser was never the intention
>> and I think it has features that are really good and fast!
>
> Yes, it should be possible to offer nice nested access and nice flat
> access from the same objects. Internally the current Biopython PDB
> structure could perhaps be handled as filtered views of a complete
> list of all the atoms (using sets and trees or a database or  
> whatever).
> That might make some things faster too.

I agree to some extent. As above, I can only say that I cannot see the  
advantage of a nested data structure.
Maybe you can explain with an example where drilling through the  
nested structure could come in handy.

>> Yes that was one thing that we were really missing. Also the fact  
>> that
>> biopython requires the unfolded entity to be converted to vectors  
>> and so
>> forth was a bit complex and we needed fast and direct access to the
>> coordinates, the very essence of pdb files.
>
> I'm not quite sure what you mean here by "vectors". Could you
> be a little more specific? Do you want NumPy style objects or
> something else?


In p3d the atom objects are vectors, so writing an structural  
alignment script is straight forward (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP 
). Or to find the geometric centre of the protein/a residue/ a chain  
or a custom set is simply
centre = p3d.vector.Vector()
for atom in atoms:
	centre += atom
centre = centre/len(atoms)

So distances between two atoms are the length of their subtraction, e.g
atomA.distanceTo(atomB) will yield the same as abs(atomA-atomB)

Yes similar to a NumPy object, but without the big NumPy overhead and  
more specific to atoms, e.g. atom.resid, atom.chain, atom.beta, atom.x.





More information about the Biopython mailing list