[Biopython] define circular DNA (?)

Peter Cherepanov p.cherepanov at imperial.ac.uk
Tue Mar 8 07:12:26 EST 2011


ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define:

c = CircularSeq('ATGCGGGGA')

where:

c[1:9]  equals  ATGCGGGGA   (or, more awkwardly, c[0:9], if the original Python string numbering must be retained for some reasons)
c[8:7]  equals  GAATGCATG    
c[1:1] equals A  (on a python string it is c[0:1]  =  A, of course)

Ideally, we would want to number such sequences from 1, after all these are the kind of objects we deal in biology. 

And, most importantly of all, if must be able to:
c.find('GGAATG') to return "7"  

Peter




On 8 Mar 2011, at 10:48, Peter Cock wrote:

> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>> 
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>> 
>> thanks a lot for responding!
>> 
>> with best wishes,
>> 
>> Peter
> 
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.
> 
> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
> 
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
> 
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.
> 
> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).
> 
> Another example, what about c[-2:]? For a plain string you'd
> get the last two letters. For a circular sequence you might think
> that should represent starting two before the origin, thus giving
> the last two letter plus the whole sequence? Also, c[-2:2] could
> mean the last two letters plus the first two letters, but for a
> plain python string that returns an empty string.
> 
> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).
> 
> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe. The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?
> 
> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?
> 
> Peter
> 
> P.S. Please CC the mailing list in your replies :)




More information about the Biopython mailing list