[Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities

Thu Oct 17 23:25:42 UTC 2013

Hey James,

NeighborSearch. Unless you edit your structure, you should always get back
the same atoms from whatever method you use to access that structure
(unfold_entities, NeighborSearch, iteration, etc). If you want paste your
code here and I can try to explain what is going on, maybe it makes things
a bit more clear. Also, if you can reproduce the weird behavior, paste the
code on pastebin.com or write it here in the thread so we can try it on our
machines too. The behavior of the get_id() is not inconvenient at all, at
least from a biological point of view. You usually want the residue
position in the amino acid sequence, not the computer science data
structure index. You should always work with the indices from the PDB file,
otherwise biologists will get quite mad at you if you start using the other
numbers :)

Chain breaks. Chain breaks are literally a break (discontinuity) in the
polypeptide chain. Sometimes you cannot get enough density from the x-ray
experiment to accurately determine the position of particular atoms. You
usually see this at lower resolution structures (3Å) or very mobile regions
(loops). What happens is that you therefore get a gap in the structure,
say, from residue 14 to residue 20 there is nothing there. But in the
sequence that went in the x-ray beam, these 5 residues (15-19) were there,
so you get the numbering taking them into account as well. As for
implications.. well, depends on what you are doing with the structure.
Calculating distances, iterating over residues, etc, will not be
problematic at all. You will just 'miss' some residues because they are
just not there. You might want to pay particular attention if you are
renumbering your structure to make sure it 'respects' these gaps for
example.

2h62 has 4 different chains and they are indeed complete. I get the same
warning but for all chains, and the lines I get notified about are the
first solvent molecules of that particular chain. The way StructureBuilder
works is a bit silly indeed: it iterates over the lines of the PDB file and
when it finds a different chain identifier from the one it was reading in
the line before it adds a new chain. If this chain already exists, it
raises this warning. It's a bit silly in this case because HETATM should
not be accounted for in this situation since they always come at the end of
the file.. If you can, submit a bug report or feature report in our tracker
and I'll go over it when I have some free time.

Cheers,

João

2013/10/18 James Jensen <jdjensen at eng.ucsd.edu>

>  Hi, João,
>
> A late reply is much better than no reply. I'm impressed you tracked this
> down and followed up, and I appreciate your help. And it took me a while to
> get around to revisiting this myself.
>
> I was using Python 3.2 when I got the "unorderable types" error. For
> unrelated reasons, I ended up switching to Python 2.7.3, and now doing
> search_all at the residue level works.
>
> The issue with the indexing is not that the residues' get_id() function
> returns a different number from the corresponding list index in the list
> returned by Selection.unfold_entities(). That is inconvenient, but I've
> been working around it. What puzzled me is that it appeared that the
> NeighborSearch was accessing residues that unfold_entities() wasn't
> accessing, although this wouldn't make sense because I used NeighborSearch
> on the results of a call to unfold_entities(). Let me check again; it could
> have been something I was doing wrong.
>
> What do the chain breaks mean? Are they missing data, and if so, what is
> missing? And what are their consequences for working with the data? How
> would they be problematic for iterating over residues, calculating
> distances, returning the amino acid sequence of the structure, etc?
>
> Thanks again,
>
> James
>
>
>
> On 10/08/2013 02:22 PM, João Rodrigues wrote:
>
> Dear James,
>
>  Regarding problem 1. What you describe runs fine on my machine, using
> Python 2.7.5 and an up-to-date Biopython git version. You logic seems fine,
> maybe it's the version of Python you are using?
>
>  Regarding your second problem, that of the mismatched indexes. The
> Selection method returns a *list* of residues while when you iterate over
> the neighbors and ask for their id it gives back the id of the residue.
> This id will only correspond to the Selection list index if your residues
> are numbered from 1 to N without gaps. If your protein starts at residue 3,
> then the first item given back by Selection has index 0 while in fact the
> id is 3. Does this make sense?
>
>  The warning occurs if you have chain breaks. There should be some gaps
> in your structure, starting at a number other than 1 does not raise this
> warning normally.
>
>  Cheers and sorry for the late reply,
>
>  João
>
>
>
> 2013/8/30 James Jensen <jdjensen at eng.ucsd.edu>
>
>> Hello!
>>
>> I am writing a function that, given two chains in a PDB file, should
>> return 1) the positions and identities of all residues that are in contact
>> with (distance < 5 angstroms) a residue on the other chain, and 2) the
>> amino acid sequences of the chains. I've been doing this with
>> NeighborSearch.search_all(radius=5, level='A') and then for each atom pair,
>> seeing what its parent residue is and whether the parent residues of the
>> two atoms belong to different chains. This may seem like a roundabout way
>> of doing it, but if I call search_all(radius=5, level='R'), or indeed with
>> level=any level other than 'A', I get the error
>>
>>         TypeError: unorderable types: Residue() < Residue()
>>
>> So my first question is why it might be that search_all isn't working at
>> higher levels.
>>
>> For the adjacent residue pairs I identify using NeighborSearch, I get
>> each residue's position in its respective chain by residue.get_id()[1].
>>
>> I've noticed, however, that if I get the sequence of the chain using seq
>> = Selection.unfold_entities(chain, 'R') and then reference (i.e.
>> seq[index]) the amino acids using the indices returned by the
>> NeighborSearch step, they are not the same residues that I get if during
>> the NeighborSearch step I report residue.get_resname() for each adjacent
>> residue.
>>
>> I've tried it with several proteins, and the problem is the same. Chains
>> A and C of 2h62 are an example.
>>
>> I then noticed that the lowest residue ID number of the residues yielded
>> from Selection.unfold_entities(chain, 'R') is not 1. For chain A, it's 11,
>> and for chain C, it's 34. Not knowing why this was, I thought I'd try
>> subtracting the lowest ID number from the indices returned by the
>> NeighborSearch step (i.e. in chain A, 11 -> 0 so seq[0] would be the first
>> residue, the one with ID 11). This happened to seem to work for chain A.
>> However, it gives me negative indices for some of the contacts in chain C.
>> This means that NeighborSearch can return residues that are not returned by
>> unfold_entities(). The lowest residue ID returned by NeighborSearch for
>> chain C was 24, whereas for unfold_entities() it was 34.
>>
>> For both chains A and C, I was given the warning
>>
>>         PDBConstructionWarning: WARNING: Chain [letter] is discontinuous
>> at line [line number].
>>
>> In fact, I seem to get this warning for just about every chain of every
>> structure I load. Is this the reason that the first residues in the two
>> chains are at 11 and 34, rather than 1? If so, could it be that
>> NeighborSearch is able to work around the discontinuity while
>> unfold_entities is not?
>>
>> Any suggestions?
>>
>> Thanks for your time and help,
>>
>> James Jensen
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>