[Biopython-dev] Circular sequences

Antony Lee antony.lee at berkeley.edu
Wed Jan 16 14:09:32 EST 2013


I think the proposed behaviour makes biological sense (now s[x:] and
s[:y] mean "cut the sequence before x (or before y) and keep the
downstream (or upstream) sequence, whatever it is").  But I understand
Peter's concerns as well.  A quick grep showed me around 400 instances
of "[:" showing up in the current code base, and as many ":]", and most
of them seem to be related to string (as opposed to sequence) processing
so checking these may not be impossible (though not very fun of course),
but this won't protect against future mis-uses of sequence indexing.

So I think methods such as cut and roll are fine too (and go back to
raising ValueError when either or both ends of the slice are None).  Now
it would be the responsibility of sequence-consuming functions to start
by .cut()ting the sequence before slicing it.

find and __contains__ can be implemented easily (though perhaps
inelegantly) by changing "foo in circular(bar)" into "foo in linear(bar)
+ linear(bar)[:len(foo)-1]" (which is essentially what is done in both
Restriction libraries, the old and the new one).

Finally let me say that right now I don't use the most of the rest
of Biopython (and don't really think I'll use most of it in the near
future) so I care little about whether this specific feature gets
integrated or not; however I do think it is needed in a proper
restriction analysis library.  Indeed, one could say that we just have
to add a "circular=True|False" keyword argument to methods such as
search and catalyze, but that is not enough to distinguish e.g. if a
circular plasmid is digested once or not at all (of course, one can
check separately but what I mean there is that circularity is a natural
"output" of the functions, not just input).

Antony

On Wed, Jan 16, 2013 at 10:24:13AM +0000, Peter Cock wrote:
> For those that missed it last time, I think the most recent in depth
> discussion about circular sequences and slicing was here:
> 
> http://lists.open-bio.org/pipermail/biopython/2011-March/007075.html
> ...
> http://lists.open-bio.org/pipermail/biopython/2011-March/007085.html
> 
> On Wed, Jan 16, 2013 at 9:42 AM, Markus Piotrowski
> <Markus.Piotrowski at ruhr-uni-bochum.de> wrote:
> > Am 15.01.2013 22:45, schrieb Antony Lee:
> >
> >> needed to y to make it bigger than x, and stop there).  Slicing with one
> >> or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError
> >> (because, well, I read s[x:] as "return the elements of s starting from
> >> the x'th until the end"... but there is no such end.).  (A second option
> >> would be to return an infinite iterable for s[x:], but that doesn't take
> >> care of s[:y] anyways, not to mention the bugs that may appear from
> >> that.)
> >
> >
> > Another possibility, which makes some biological sense (thinking on
> > restriction), would be that
> > s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1
> > (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my
> > circle at x and return the linear sequence starting at x'.
> 
> That's exactly the kind of behaviour which would make me nervous
> given in general the Biopython sequence objects mimic Python strings.
> There are many examples where that 'extra' sequence would be
> unexpected. For instance, writing out line wrapped sequence data.
> 
> I would prefer an explicit method like 'cut' on a circular sequence
> object returning a full length linear sequence. Similarly a 'roll' or
> 'rotate' method could shift the origin to a new coordinate.
> 
> One simple solution to the complexities of the slice behaviour is
> the practical one: They act like Python strings, basically all we
> would be adding would an 'is circular' flag and some logic about
> how to propagate that flag in operations like addition and slicing.
> If we went that route it might still be possible to make the find and
> 'in' functionality origin aware... but that may just cause trouble.
> 
> This would solve where to store if a sequence is circular (e.g. when
> reading GenBank and EMBL files - or for handling restriction
> enzyme digests), but other than that not add much utility.
> 
> Thoughts?
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


More information about the Biopython-dev mailing list