[Biopython] define circular DNA (?)
Moritz Beber
moritz.beber at googlemail.com
Tue Mar 8 11:32:44 UTC 2011
On 03/08/2011 11:48 AM, Peter Cock wrote:
> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>>
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>>
>> thanks a lot for responding!
>>
>> with best wishes,
>>
>> Peter
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.
>
> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
>
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
>
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.
>
> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).
>
> Another example, what about c[-2:]? For a plain string you'd
> get the last two letters. For a circular sequence you might think
> that should represent starting two before the origin, thus giving
> the last two letter plus the whole sequence? Also, c[-2:2] could
> mean the last two letters plus the first two letters, but for a
> plain python string that returns an empty string.
>
> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).
>
> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe. The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?
>
> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?
>
> Peter
>
> P.S. Please CC the mailing list in your replies :)
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
If you just need circular behaviour in a small number of use cases, you
could consider wrapping the sequence in a cycle iterator
http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle
More information about the Biopython
mailing list