[Biopython] define circular DNA (?)

Tue Mar 8 13:58:03 UTC 2011

On Tue, Mar 8, 2011 at 1:28 PM, Leighton Pritchard
<Leighton.Pritchard at scri.ac.uk> wrote:
> I've got 2p hanging around, so...
>
> On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock"
> <p.j.a.cock at googlemail.com> wrote:
>>
>> What I had in mind was a new class, CircularSeq, which would subclass
>> the current Biopython Seq object, and still use a string internally for the
>> sequence.
>
> That seems sensible.  The main issue, as I see it, is that the physical
> object is naturally represented by a circularly-linked list, and we have for
> circular sequences an indexing/co-ordinate system with a defined zero
> start/end point (which is essentially arbitrary - though is usually the
> origin of replication for bacterial chromosomes).  This leads to a conflict
> between our natural expectations of Python indexing, and the meaning of the
> indexing on the physical object that's being represented.
>
> Whatever the ultimate implementation, there will either have to be a
> compromise between these two representations, or one or other view will be
> ignored.  There will inevitably be value judgements that someone is unhappy
> with ;)

Indeed.

>> We could then modify the slice behaviour so that, perhaps this would
>> by work wrapping the origin:
>>
>> c = CircularSeq('ACGTACGTACGT')
>> assert len(c)==12
>> print c[10:14]
>>
>> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
>> 14 as wrapped to 2, returning the four bases GTAC.
>
> That makes sense in Python indexing terms, but not in terms of the
> co-ordinate system for navigating the circular DNA.  To be consistent with
> location information from GenBank and other sources where features wrap the
> origin of circular DNA, we would need c[10:2] to return the same result as
> c[10:14].  That gives us potentially the same problem as c[-2:2], as it
> currently returns an empty string.  We'd have to modify Python
> slicing/indexing behaviour quite a bit to implement this 'naturally'.
>
> However, I don't think we should ignore the Python indexing format here,
> because we might want the ten bases after the base with co-ordinate 6 with
> c[6:6+10], which would give us a physically and conceptually sensible linear
> sequence that crosses the origin.

I think we agree that c[10:14] and c[10:10+4] should give the four bases
GTAC wrapping the origin when c is circular sequence ACGTACGTACGT,
equivalently c[10:12] + c[0:2] using Python slicing.

Likewise for your example c[6:6+10] or c[6:16] this should give six bases
wrapping the origin, equivalently c[6:12] + c[0:4] using Python slicing.

> We'd probably want to do the obvious things with modular arithmetic, so that
> we don't return, say, three concatenated linearised circular sequences to a
> request like c[0:36] or c[6:42].

I disagree, returning the three concatenated linearised circular sequences
is what I would expect. This is one of the debatable issues that will divide
people. Consider the (special and artificial) case of a circular plasmid with
an ORF wrapping round the origin (one, twice or infinite), the ORF sequence
is longer than the linearised plasmid, so slicing with concatenation would
be useful. e.g.

http://www.ncbi.nlm.nih.gov/pubmed/9740124
Perriman and Ares (1998), Circular mRNA can direct translation of
extremely long repeating-sequence proteins in vivo.

and:

http://dx.doi.org/10.1385/1-59259-280-5:069
Perriman (2002), Circular mRNA Encoding for Monomeric and
Polymeric Green Fluorescent Protein

(Very cool work)

>> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
>> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
>> This means anyone (or more importantly, any code) expecting the
>> string like behaviour will get a nasty surprise (or a bug).
>
> I'm not sure it's wise to constrain functionality and adequate
> representation of a (very important! - showing my bacterial bias) physical
> structure to maintain that level of consistency with String.  For instance,
> what would CircularSeq + Seq mean?  Physically, and conceptually, not a lot.
> So we might want to deprecate the __add__ method for this object - not
> typical String behaviour but, in my opinion, appropriate.

We're probably want to made addition of CircularSeq + Seq raise a
TypeError. Or, do a linearisation and simple addition with a warning?

> (You might remember that I was also generally not in favour of treating
> Seq objects as idealised Strings, so there's another bias for you ;) )

I recall :)

>> Note that due to the way Python indexing works, single letter
>> access is fine for negative indices, c[-2] would give the second
>> last letter, 'G', which is consistent with wrapped counting back
>> from the origin. We could also make c[14] wrap round to c[2] in
>> this length 12 example (although there is a small risk of breaking
>> code expecting an IndexError in this case).
>
> I wouldn't be in favour that behaviour in a general sense, though I don't
> see how to avoid it cleanly. I think it would be best to be strict with
> indexing to the co-ordinate system to avoid possible degeneracy of feature
> locations.  If we had a SNP at position 2, we could equally well associate
> it with any one of an infinite number of positions kl+2 where k is an
> integer and l is the sequence length, without modifying the computational
> result.

Yes, I was suggesting we could make c[x+n*length] act as c[x],
i.e. for *single* indexes which return one letter, apply the modulo
arithmetic. Or, we leave this to follow the current Python string
behaviour where if the index is equal to the length or more, you
get an IndexError. That avoids the ambiguity ;)

> I'm not keen on that kind of woolliness, but I think that it could
> possibly be avoided by modifying indexing to require at least one index that
> lies in the range [-l,l], and using modular arithmetic for slicing so that,
> for the example above, c[18:26] would not be treated as the valid slice
> c[6:14], but would instead throw an IndexError.

This depends on the treatment of things like c[0:36] or c[6:42]
discussed above (return 36 bases, or just 12?).

>> There would be lots of other things to implement, like "in" and the
>> find methods would need to check the substring across the origin.
>> Then (for nucleotides), we'd need to ensure reverse_complement
>> and complement also give a CircularSeq, likewise perhaps for the
>> transcribe and back_transcribe.
>
> Not to mention the other Biopython functions/methods that expect String-like
> indexing.  Maybe a cast (of sorts) between CircularSeq and Seq would be
> useful for that, though I can imagine great problems, there.

Having a toseq method like the MutableSeq does could handle that,
returning a traditional linear Seq object. If the CircularSeq 'breaks'
too much expected string-like behaviour that would be important.

>> The translate method is particularly
>> tricky as you can have an infinite reading frame, which might be
>> represented as a circular protein sequence?
>
> I would think that the test for that particular condition should be fairly
> straightforward (is there at least one stop codon in each of the six frames,
> taking into account the origin?).

Having thought about this example at length before, it can be done
but I don't think it is all that straightforward ;)

>> All in all, it is quite a lot of work, and there are several tricky bits
>> where the desired behaviour is not clear cut. Could we come up
>> with something useful or not?
>
> I think that there's every possibility of coming up with something useful -
> the question is to what degree it fits the Biopython/Python idiom, or 'looks
> like' the physical object, and whether it gets included in Biopython.
>
> L.

Agreed.

Peter