[Biopython] define circular DNA (?)
Peter Cherepanov
p.cherepanov at imperial.ac.uk
Tue Mar 8 07:12:26 EST 2011
ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define:
c = CircularSeq('ATGCGGGGA')
where:
c[1:9] equals ATGCGGGGA (or, more awkwardly, c[0:9], if the original Python string numbering must be retained for some reasons)
c[8:7] equals GAATGCATG
c[1:1] equals A (on a python string it is c[0:1] = A, of course)
Ideally, we would want to number such sequences from 1, after all these are the kind of objects we deal in biology.
And, most importantly of all, if must be able to:
c.find('GGAATG') to return "7"
Peter
On 8 Mar 2011, at 10:48, Peter Cock wrote:
> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>>
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>>
>> thanks a lot for responding!
>>
>> with best wishes,
>>
>> Peter
>
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.
>
> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
>
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
>
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.
>
> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).
>
> Another example, what about c[-2:]? For a plain string you'd
> get the last two letters. For a circular sequence you might think
> that should represent starting two before the origin, thus giving
> the last two letter plus the whole sequence? Also, c[-2:2] could
> mean the last two letters plus the first two letters, but for a
> plain python string that returns an empty string.
>
> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).
>
> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe. The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?
>
> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?
>
> Peter
>
> P.S. Please CC the mailing list in your replies :)
More information about the Biopython
mailing list