[Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ?

Hongbo Zhu 朱宏博 macrozhu at gmail.com
Mon Dec 5 14:48:25 UTC 2011


On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu 朱宏博 <macrozhu at gmail.com> wrote:
> >
> >> Perhaps I misunderstood - I would not want to allow the syntax
> >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only
> allow
> >> the user to use mychain[2:41] which requires Python counting.
> >
> > But even in mychain[2:41], the 2 and 41 should be residue sequence
> number.
> > Then it is consistent with the current acceptable syntax mychain[2],
> where 2
> > also refers to a sequence number. At the moment, BioPython also
> > accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ',
> 40,
> > ' ')] would be just a nature extension of mychain[(' ', 2, ' ')].
> >
> > According to the source code, mychain[2] is considered an abbreviation of
> > mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to
> > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if
> > mychain[2:4] would be allowed, internally it would also
> > be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my
> > point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2,
> '
> > '): (' ', 40, ' ')], just like mychain[2] is a short version of
> mychain[('
> > ',2,' ')].
> >
> > hongbo
>
> I've never really liked these strange tuple IDs, which are usually
> but not always full of empty values. I understand some of
> the corner cases they handle, but they are very complicated.
>

This seems to be the problem of PDB. I don't know how other packages handle
the issue.
In addition, I once proposed to remove the HETERO-flag in the residue ID.
http://biopython.org/pipermail/biopython-dev/2011-January/008640.html
It is only retained for the backwards compatibility with PDB files before
remediation in 2007. Removing only HETERO-flag does not solve
the problem totally, but to some extent (say, around 50%).

I think, one possible solution is to treat residue ID always as a string
'%4d%s' % (resnum, icode) instead of a tuple composed of resnum plus icode
(if we do not consider HETERO-flag). This way, biotpyhon also serves to
remind users that icode is indispensable for uniquely locating a residue
even if icode is empty. But this would lead to formidable API change.



> You cannot assume 2 will map to (' ', 2, ' ') in general - this
> is what the _translate_id method handles. Consider the case
> where you have sliced the Chain as discussed, since the
> first two elements have been removed, that mapping will shift.
>

I am afraid I don't concur. As a matter of fact, the mapping are internally
fine-tuned by comparing the input residue ID to a list of all residue IDs
so the correct index values are obtained for the slicing:

        res_id_list = [r.id for r in self.get_iterator()]
        start_index = res_id_list.index(self._translate_id(start))
        stop_index  = res_id_list.index(self._translate_id(stop))
        return self.get_list()[start_index:stop_index:step]

It is like mychain[2], this 2 will be first translated to (' ',2,' ') and
this residue ID is searched in the dictionary indexed by all residues IDs
to locate the correct residue.


> We definitely would need a test case covering non-trivial
> ID tuples (e.g. using insertion codes), and tests slicing a
> previously sliced Chain.
>

PDB entry 1h4w is a good example with icode and the sequence of chain A
starts with resnum 16.

    def get_slice(self, start, end, step=None):
        """Return a slice of the chain from start to end (including end
position)

        Arguments:
        o start - (string, int, string) or int
        o end - (string, int, string) or int
        o step - None or int
        """
        res_id_list = [r.id for r in self.get_iterator()]
        start_index = res_id_list.index(self._translate_id(start))
        end_index  = res_id_list.index(self._translate_id(end))

        return self.get_list()[start_index:end_index+1:step]

In [66]: mychain.get_slice(182,189)
Out[66]:
[<Residue CYS het=  resseq=182 icode= >,
 <Residue VAL het=  resseq=183 icode= >,
 <Residue GLY het=  resseq=184 icode= >,
 <Residue PHE het=  resseq=184 icode=A>,
 <Residue LEU het=  resseq=185 icode= >,
 <Residue GLU het=  resseq=186 icode= >,
 <Residue GLY het=  resseq=187 icode= >,
 <Residue GLY het=  resseq=188 icode= >,
 <Residue LYS het=  resseq=188 icode=A>,
 <Residue ASP het=  resseq=189 icode= >]



> Peter
>



-- 
Hongbo




More information about the Biopython-dev mailing list