Sequence numbering (was: Re: [BioPython] bioperl idl)

Andrew Dalke dalke@bioreason.com
Thu, 23 Sep 1999 16:27:39 -0600


Jeff said:
> For example, the Swiss-Prot feature table entries specify sequence
> features based on a strict ordinal numbering of the residues starting
> from one.  If our sequence indexing were 1-based and end-inclusive,
> we could do:
>
>  seq[feature.start:feature.end]
>

Since file I/O is included in what I meant by "interface", I would
have the parser convert to the Python numbering.  In that way, all
range references you get could be used in the form start:end.

Konrad:
> Here's a compromise proposal: Make indexing and slicing act the Python
> way, i.e. 0-based, and implement an additional method (e.g.
> "subsequence") that used "biological style" indices.

That got me to thinking, how about:

class Seq:
    def __init__(self, seq):
	self.seq = seq
    def __getattr__(self, key):
	if key == "subseq":
	    return SubSeq(self)
	raise AttributeError, key
    def __getslice__(self, i, j):
	return self.seq[i:j]

class SubSeq:
    def __init__(self, seq):
	self.seq = seq
    def __call__(self, min, max):
	# with a base of 1 and including the end
	# Negative slice notation not allowed
	assert min>=1 and max >= min
	return self.seq[min-1:max]
    def omg(self, min, max):
	# with a base of 1 and excluding the end
	# Negative slice notation not allowed
	assert min>=1 and max>min
	return self.seq[min-1:max-1]
    def perl(self, min, max):
	# with a base of 0 and including the end
	# Negative slice notation not allowed
	assert min>=1 and max>=min
	return self.seq[min:max+1]
    def python(self, min, max):
	return self.seq[min:max]


>>> seq = Seq("ANDREW")
>>> print seq[1:3]
ND
>>> print seq.subseq(1,3)
AND
>>> print seq.subseq.omg(1,3)
AN
>>> print seq.subseq.perl(1,3)
NDR
>>> print seq.subseq.python(1,3)
ND

This has the obvious ability to support any indexing scheme, such as
for PDB ATOM sequencing.  It also seems pretty easy to allow
registering different translation schemes so there isn't really a need
to develop new classes all the time.  Besides, there are only about a
handful of non-linear/non-integer numberings to deal with.

Only thing I would add is that use of the subseq (or whatever for
the new function) should as infrequent as possible.  For developer
people that's an easy decision, since you don't want to take the
performance hit of the intermediate lookups and object creation.

						Andrew
						dalke@bioreason.com