[Biopython] define circular DNA (?)

Tue Mar 8 08:28:11 EST 2011

I've got 2p hanging around, so...

On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock"
<p.j.a.cock at googlemail.com> wrote:

> On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov
> <p.cherepanov at imperial.ac.uk> wrote:
>> I suppose if a DNA sequence is kept as a simple Python string, there is
>> no easy way to have it "circular". I am a beginner in Python (I use it only
>> occasionally, to solve very specific and simple-minded tasks, when manual
>> match/cut-and-paste operations become too much of a burden). Having
>> spent an extra hour to hack out and debug a piece of code to match/extract
>> to/from circular plasmid sequences kept as Python strings, I thought: hey,
>> wait a minute, there is such thing as BioPython, which should have made
>> this task so much easier...
>>
>> Is there a way to "enhance" the Seq object? (or may be I do not know what
>> I am talking about...).
>>
>> thanks a lot for responding!
>>
>> with best wishes,
>>
>> Peter
>
> What I had in mind was a new class, CircularSeq, which would subclass
> the current Biopython Seq object, and still use a string internally for the
> sequence.

That seems sensible.  The main issue, as I see it, is that the physical
object is naturally represented by a circularly-linked list, and we have for
circular sequences an indexing/co-ordinate system with a defined zero
start/end point (which is essentially arbitrary - though is usually the
origin of replication for bacterial chromosomes).  This leads to a conflict
between our natural expectations of Python indexing, and the meaning of the
indexing on the physical object that's being represented.

Whatever the ultimate implementation, there will either have to be a
compromise between these two representations, or one or other view will be
ignored.  There will inevitably be value judgements that someone is unhappy
with ;)

> We could then modify the slice behaviour so that, perhaps this would
> by work wrapping the origin:
>
> c = CircularSeq('ACGTACGTACGT')
> assert len(c)==12
> print c[10:14]
>
> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat
> 14 as wrapped to 2, returning the four bases GTAC.

That makes sense in Python indexing terms, but not in terms of the
co-ordinate system for navigating the circular DNA.  To be consistent with
location information from GenBank and other sources where features wrap the
origin of circular DNA, we would need c[10:2] to return the same result as
c[10:14].  That gives us potentially the same problem as c[-2:2], as it
currently returns an empty string.  We'd have to modify Python
slicing/indexing behaviour quite a bit to implement this 'naturally'.

However, I don't think we should ignore the Python indexing format here,
because we might want the ten bases after the base with co-ordinate 6 with
c[6:6+10], which would give us a physically and conceptually sensible linear
sequence that crosses the origin.

We'd probably want to do the obvious things with modular arithmetic, so that
we don't return, say, three concatenated linearised circular sequences to a
request like c[0:36] or c[6:42].

> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the
> same as 'ACGTACGTACGT'[10:] which is the last two letters only.
> This means anyone (or more importantly, any code) expecting the
> string like behaviour will get a nasty surprise (or a bug).

I'm not sure it's wise to constrain functionality and adequate
representation of a (very important! - showing my bacterial bias) physical
structure to maintain that level of consistency with String.  For instance,
what would CircularSeq + Seq mean?  Physically, and conceptually, not a lot.
So we might want to deprecate the __add__ method for this object - not
typical String behaviour but, in my opinion, appropriate.

(You might remember that I was also generally not in favour of treating Seq
objects as idealised Strings, so there's another bias for you ;) )

> Note that due to the way Python indexing works, single letter
> access is fine for negative indices, c[-2] would give the second
> last letter, 'G', which is consistent with wrapped counting back
> from the origin. We could also make c[14] wrap round to c[2] in
> this length 12 example (although there is a small risk of breaking
> code expecting an IndexError in this case).

I wouldn't be in favour that behaviour in a general sense, though I don't
see how to avoid it cleanly.  I think it would be best to be strict with
indexing to the co-ordinate system to avoid possible degeneracy of feature
locations.  If we had a SNP at position 2, we could equally well associate
it with any one of an infinite number of positions kl+2 where k is an
integer and l is the sequence length, without modifying the computational
result.  I'm not keen on that kind of woolliness, but I think that it could
possibly be avoided by modifying indexing to require at least one index that
lies in the range [-l,l], and using modular arithmetic for slicing so that,
for the example above, c[18:26] would not be treated as the valid slice
c[6:14], but would instead throw an IndexError.

> There would be lots of other things to implement, like "in" and the
> find methods would need to check the substring across the origin.
> Then (for nucleotides), we'd need to ensure reverse_complement
> and complement also give a CircularSeq, likewise perhaps for the
> transcribe and back_transcribe.

Not to mention the other Biopython functions/methods that expect String-like
indexing.  Maybe a cast (of sorts) between CircularSeq and Seq would be
useful for that, though I can imagine great problems, there.

> The translate method is particularly
> tricky as you can have an infinite reading frame, which might be
> represented as a circular protein sequence?

I would think that the test for that particular condition should be fairly
straightforward (is there at least one stop codon in each of the six frames,
taking into account the origin?).

> All in all, it is quite a lot of work, and there are several tricky bits
> where the desired behaviour is not clear cut. Could we come up
> with something useful or not?

I think that there's every possibility of coming up with something useful -
the question is to what degree it fits the Biopython/Python idiom, or 'looks
like' the physical object, and whether it gets included in Biopython.

L.

--
Dr Leighton Pritchard MRSC
Plant Pathology Programme, SCRI (C block)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel: No telephone during office refurbishment

[The James Hutton Institute logo]
Please note that from 1 April 2011, SCRI and the Macaulay Land Use Research Institute will join to become The James Hutton Institute.

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________