[Biopython] Biopython & p3d

Peter biopython at maubp.freeserve.co.uk
Wed Oct 21 18:14:10 EDT 2009


On Wed, Oct 21, 2009 at 7:22 PM, Christian Fufezan wrote:
>> Biopython might be improved by defining an atom
>> property (list or iterator?) instead of the get_atoms() method.
>
> agree.  I would argue that p3d's atom/vector class seems the way to go.

We can probably have similar things for chains etc. Any other
views on this? I never liked the get_* and set_* methods in
Bio.PDB myself, and using Python properties seem more
natural here (they may not have existing when Bio.PDB was
first started - I'd have to check).

[We should probably break out specific suggestions like this
into new mailing list threads, and CC people like Thomas H.]

>> One might also ask for x, y and z properties on the atom object
>> to provide direct access to the three coordinates as floats. Do
>> you think this sort of little thing would help improve Bio.PDB?
>>
> yes indeed, that is _the_ information a pdb module should offer
> without any addition. Better would be even if the atoms are
> treatable as vectors (see below). p3d has a series of atom
> object attributes that are convenient.

I would argue that the x-y-z triple (which Biopython has) is
more important that separate x, y, and z floats. We seem
to agree here.

The Biopython atom's coord property is an x-y-z triple (as a
one dimensional numpy array). The Bio.PDB code also
defines its own vector objects on top of this, but my memory
of the details is hazy here. As I recall, I personally stuck
with the numpy objects in my scripts using Bio.PDB.

>> Yes, it should be possible to offer nice nested access and nice flat
>> access from the same objects. Internally the current Biopython PDB
>> structure could perhaps be handled as filtered views of a complete
>> list of all the atoms (using sets and trees or a database or whatever).
>> That might make some things faster too.
>
> I agree to some extent. As above, I can only say that I
> cannot see the advantage of a nested data structure.
> Maybe you can explain with an example where drilling
> through the nested structure could come in handy.

The drill down is great for selecting a particular residue or
chain (or for NMR, a particular model). It is also good for
looping over these structures - e.g. to process psi/phi
angles along a protein backbone.

>>> Yes that was one thing that we were really missing. Also the fact that
>>> biopython requires the unfolded entity to be converted to vectors and so
>>> forth was a bit complex and we needed fast and direct access to the
>>> coordinates, the very essence of pdb files.
>>
>> I'm not quite sure what you mean here by "vectors". Could you
>> be a little more specific? Do you want NumPy style objects or
>> something else?
>
> In p3d the atom objects are vectors,

I don't immediately see what the intention is here. What does
"adding" or "subtracting" two atom/vector objects give you? A
new non-atom vector would be my guess? What about
multiplying by a scaler? Again, getting a non-atom vector
object back makes most sense.

> so writing an structural alignment script is straight forward
> (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP).

Structural alignment is not so different in Biopython - just the details. e.g.
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

> Or to find the geometric centre of the protein/a residue/ a chain
> or a custom set is simply
> centre = p3d.vector.Vector()
> for atom in atoms:
>        centre += atom
> centre = centre/len(atoms)

And you can do all of that with the NumPy array of three coordinates
accessed via atom.coord - in many respects it is a "vector". For
example, with a typical Bio.PDB Residue object, the geometric
center/centre is just one line:

>>> centre = numpy.sum(atom.coord for atom in residue) / len(residue)
>>> centre
array([ -0.21274999,   2.609375  ,  13.95149994], dtype=float32)

The centre of mass would be more interesting to calculate,
but for that we need the atomic masses.

> So distances between two atoms are the length of their subtraction, e.g
> atomA.distanceTo(atomB) will yield the same as abs(atomA-atomB)

I guess your atomA-atomB returns a vector, and abs() gives
its length.

You can get the distance between to Bio.PDB atoms with
atomA-atomB (and you don't need to stick an abs on it either,
because our atoms are not trying to act like vectors - we
can just return a float).

> Yes similar to a NumPy object, but without the big NumPy overhead
> and more specific to atoms, e.g. atom.resid, atom.chain, atom.beta,
> atom.x.

Well, yes, NumPy is a big project, and Bio.PDB is one of the main
bits of Biopython that uses it. But it is very useful for numerical
work, and a good choice here I think. And assuming you *like*
numpy, having the Bio.PDB atom objects expose the x-y-z
coordinates as a simple one dimensional numpy array of floats
is very natural.

You said early:
>>> Also the fact that biopython requires the unfolded entity
>>> to be converted to vectors and so forth was a bit complex
>>> and we needed fast and direct access to the coordinates,
>>> the very essence of pdb files."

I disagree. The Biopython atom objects give "fast and direct
access to the coordindates" via the coord property, which is a
a one-dimensional numpy array (aka, a vector). For fast
and efficient numerical operations there is no need to
convert this into anything else (although a bespoke vector
object may make things more elegant).

Peter

P.S. This thread is proving quite interesting :)



More information about the Biopython mailing list