[Biopython-dev] Circular sequences

Peter Cock p.j.a.cock at googlemail.com
Wed Jan 16 10:24:13 UTC 2013


For those that missed it last time, I think the most recent in depth
discussion about circular sequences and slicing was here:

http://lists.open-bio.org/pipermail/biopython/2011-March/007075.html
...
http://lists.open-bio.org/pipermail/biopython/2011-March/007085.html

On Wed, Jan 16, 2013 at 9:42 AM, Markus Piotrowski
<Markus.Piotrowski at ruhr-uni-bochum.de> wrote:
> Am 15.01.2013 22:45, schrieb Antony Lee:
>
>> needed to y to make it bigger than x, and stop there).  Slicing with one
>> or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError
>> (because, well, I read s[x:] as "return the elements of s starting from
>> the x'th until the end"... but there is no such end.).  (A second option
>> would be to return an infinite iterable for s[x:], but that doesn't take
>> care of s[:y] anyways, not to mention the bugs that may appear from
>> that.)
>
>
> Another possibility, which makes some biological sense (thinking on
> restriction), would be that
> s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1
> (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my
> circle at x and return the linear sequence starting at x'.

That's exactly the kind of behaviour which would make me nervous
given in general the Biopython sequence objects mimic Python strings.
There are many examples where that 'extra' sequence would be
unexpected. For instance, writing out line wrapped sequence data.

I would prefer an explicit method like 'cut' on a circular sequence
object returning a full length linear sequence. Similarly a 'roll' or
'rotate' method could shift the origin to a new coordinate.

One simple solution to the complexities of the slice behaviour is
the practical one: They act like Python strings, basically all we
would be adding would an 'is circular' flag and some logic about
how to propagate that flag in operations like addition and slicing.
If we went that route it might still be possible to make the find and
'in' functionality origin aware... but that may just cause trouble.

This would solve where to store if a sequence is circular (e.g. when
reading GenBank and EMBL files - or for handling restriction
enzyme digests), but other than that not add much utility.

Thoughts?

Peter



More information about the Biopython-dev mailing list