[Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ?
Hongbo Zhu 朱宏博
macrozhu at gmail.com
Mon Dec 5 14:48:25 UTC 2011
On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu 朱宏博 <macrozhu at gmail.com> wrote:
> >
> >> Perhaps I misunderstood - I would not want to allow the syntax
> >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only
> allow
> >> the user to use mychain[2:41] which requires Python counting.
> >
> > But even in mychain[2:41], the 2 and 41 should be residue sequence
> number.
> > Then it is consistent with the current acceptable syntax mychain[2],
> where 2
> > also refers to a sequence number. At the moment, BioPython also
> > accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ',
> 40,
> > ' ')] would be just a nature extension of mychain[(' ', 2, ' ')].
> >
> > According to the source code, mychain[2] is considered an abbreviation of
> > mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to
> > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if
> > mychain[2:4] would be allowed, internally it would also
> > be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my
> > point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2,
> '
> > '): (' ', 40, ' ')], just like mychain[2] is a short version of
> mychain[('
> > ',2,' ')].
> >
> > hongbo
>
> I've never really liked these strange tuple IDs, which are usually
> but not always full of empty values. I understand some of
> the corner cases they handle, but they are very complicated.
>
This seems to be the problem of PDB. I don't know how other packages handle
the issue.
In addition, I once proposed to remove the HETERO-flag in the residue ID.
http://biopython.org/pipermail/biopython-dev/2011-January/008640.html
It is only retained for the backwards compatibility with PDB files before
remediation in 2007. Removing only HETERO-flag does not solve
the problem totally, but to some extent (say, around 50%).
I think, one possible solution is to treat residue ID always as a string
'%4d%s' % (resnum, icode) instead of a tuple composed of resnum plus icode
(if we do not consider HETERO-flag). This way, biotpyhon also serves to
remind users that icode is indispensable for uniquely locating a residue
even if icode is empty. But this would lead to formidable API change.
> You cannot assume 2 will map to (' ', 2, ' ') in general - this
> is what the _translate_id method handles. Consider the case
> where you have sliced the Chain as discussed, since the
> first two elements have been removed, that mapping will shift.
>
I am afraid I don't concur. As a matter of fact, the mapping are internally
fine-tuned by comparing the input residue ID to a list of all residue IDs
so the correct index values are obtained for the slicing:
res_id_list = [r.id for r in self.get_iterator()]
start_index = res_id_list.index(self._translate_id(start))
stop_index = res_id_list.index(self._translate_id(stop))
return self.get_list()[start_index:stop_index:step]
It is like mychain[2], this 2 will be first translated to (' ',2,' ') and
this residue ID is searched in the dictionary indexed by all residues IDs
to locate the correct residue.
> We definitely would need a test case covering non-trivial
> ID tuples (e.g. using insertion codes), and tests slicing a
> previously sliced Chain.
>
PDB entry 1h4w is a good example with icode and the sequence of chain A
starts with resnum 16.
def get_slice(self, start, end, step=None):
"""Return a slice of the chain from start to end (including end
position)
Arguments:
o start - (string, int, string) or int
o end - (string, int, string) or int
o step - None or int
"""
res_id_list = [r.id for r in self.get_iterator()]
start_index = res_id_list.index(self._translate_id(start))
end_index = res_id_list.index(self._translate_id(end))
return self.get_list()[start_index:end_index+1:step]
In [66]: mychain.get_slice(182,189)
Out[66]:
[<Residue CYS het= resseq=182 icode= >,
<Residue VAL het= resseq=183 icode= >,
<Residue GLY het= resseq=184 icode= >,
<Residue PHE het= resseq=184 icode=A>,
<Residue LEU het= resseq=185 icode= >,
<Residue GLU het= resseq=186 icode= >,
<Residue GLY het= resseq=187 icode= >,
<Residue GLY het= resseq=188 icode= >,
<Residue LYS het= resseq=188 icode=A>,
<Residue ASP het= resseq=189 icode= >]
> Peter
>
--
Hongbo
More information about the Biopython-dev
mailing list