[Biopython] define circular DNA (?)

Tue Mar 8 10:48:13 UTC 2011

On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
<p.cherepanov at imperial.ac.uk> wrote:
> I suppose if a DNA sequence is kept as a simple Python string, there is
> no easy way to have it "circular". I am a beginner in Python (I use it only
> occasionally, to solve very specific and simple-minded tasks, when manual
> match/cut-and-paste operations become too much of a burden). Having
> spent an extra hour to hack out and debug a piece of code to match/extract
> to/from circular plasmid sequences kept as Python strings, I thought: hey,
> wait a minute, there is such thing as BioPython, which should have made
> this task so much easier...
>
> Is there a way to "enhance" the Seq object? (or may be I do not know what
> I am talking about...).
>
> thanks a lot for responding!
>
> with best wishes,
>
> Peter

What I had in mind was a new class, CircularSeq, which would subclass
the current Biopython Seq object, and still use a string internally for the
sequence.

We could then modify the slice behaviour so that, perhaps this would
by work wrapping the origin:

c = CircularSeq('ACGTACGTACGT')
assert len(c)==12
print c[10:14]

It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
14 as wrapped to 2, returning the four bases GTAC.

Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
same as 'ACGTACGTACGT'[10:] which is the last two letters only.
This means anyone (or more importantly, any code) expecting the
string like behaviour will get a nasty surprise (or a bug).

Another example, what about c[-2:]? For a plain string you'd
get the last two letters. For a circular sequence you might think
that should represent starting two before the origin, thus giving
the last two letter plus the whole sequence? Also, c[-2:2] could
mean the last two letters plus the first two letters, but for a
plain python string that returns an empty string.

Note that due to the way Python indexing works, single letter
access is fine for negative indices, c[-2] would give the second
last letter, 'G', which is consistent with wrapped counting back
from the origin. We could also make c[14] wrap round to c[2] in
this length 12 example (although there is a small risk of breaking
code expecting an IndexError in this case).

There would be lots of other things to implement, like "in" and the
find methods would need to check the substring across the origin.
Then (for nucleotides), we'd need to ensure reverse_complement
and complement also give a CircularSeq, likewise perhaps for the
transcribe and back_transcribe. The translate method is particularly
tricky as you can have an infinite reading frame, which might be
represented as a circular protein sequence?

All in all, it is quite a lot of work, and there are several tricky bits
where the desired behaviour is not clear cut. Could we come up
with something useful or not?

Peter

P.S. Please CC the mailing list in your replies :)