[BioPython] Adding startswith and endswith methods to the Seq object

Mon Apr 13 15:46:06 UTC 2009

On 13/04/2009 15:46, "Peter" <peter at maubp.freeserve.co.uk> wrote:

> On Mon, Apr 13, 2009 at 3:10 PM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> I'm not at all happy about the idea of supporting ambiguity characters
> in these string-like methods of the Seq object.  Right now I was
> proposing nothing special with ambiguity symbols, so:
> 
>>>> from Bio.Seq import Seq
>>>> Seq("TAN").startswith("TAN")
> True
>>>> Seq("TAA").startswith("TAN")
> False
>>>> Seq("TAA").startswith("TAX")
> False
> 
> I agree this doesn't cover all possible use cases, but it is very
> simple, and easy to explain.

That's in its favour, but I don't think that:

"Seq.startswith() behaves as expected for standard ambiguity symbols and
regular expression syntax if the Seq object is declared with either a
protein or nucleotide alphabet, but behaves like String.startswith()
otherwise" is either complicated, or hard to explain.

I think that the choice is one that is best made on whether the
functionality is useful like this, or better implemented in some other way.

On a design point, I'm not convinced that direct emulation of String methods
in Seq objects is *always* a Good Thing.  There are String methods that it
makes sense (to me) to emulate wholesale in Seq, such as .join(),
.swapcase(), .upper(), slicing behaviours and so on.  However, .title() and
.capitalize() seem a bit out of place.  Likewise, there are plenty of Seq
methods that don't have sensible String counterparts.  This is because,
conceptually, they represent different abstract concepts, and I don't think
that we should lose sight of that when making Seq objects behave like String
objects.  I think that abstract representation of sequences and provision of
useful functionality are the important points.

> Trying to support ambiguity symbols will be alphabet dependent
> (consider the above example could be a protein or DNA), and frankly
> extremely complicated.

I don't think it *has* to be complicated at all, though it could be if we
wanted it to be.

For example, avoiding ambiguity codes for now:

>>> from Bio.Seq import Seq
>>> Seq("TAG").startswith("TA[CG]")

Could be handled internally with re.match(), in the same way that

>>> Seq("TAG").startswith("TAG")

could be.  Seq.endswith() might be implementable by checking that an
re.search() call returns at least one group that stops at the end of the
target sequence, for example.  These methods would cover pretty much every
use case I can think of right now that doesn't involve an ambiguity symbol.
They wouldn't break String.startswith() behaviour for biological sequences,
because the special symbols have no place in the biological sequence
alphabets (except, perhaps, for gap characters).

Such an implementation could gain extra *useful* functionality in
.startswith() without breaking expected behaviour.  It would also leave us
in the same position originally proposed, that ambiguity symbols have no
meaning.

> It also breaks the "act like a string" idea.

I do not agree, because there's some elision in a lawyer's definition of
'act like', as opposed to 'act as' a string that comes into play ;)

If the Seq object acts *like* a string, then *when we expect it to* that
doesn't prevent us from having functionality more appropriate for a Seq
object, in addition to or instead of String behaviour.  We're already doing
this with the Seq.transcribe() and Seq.translate() (and no Seq.title())
methods, for example.  I don't see how this differs conceptually for
extending startswith() functionality, so long as it behaves like
String.startswith() *when we expect it to*.  The issue here is then: "when
is it reasonable to expect this string of symbols to behave like a raw
string, and when is it reasonable to expect it to behave like a
biological/regex sequence of symbols?".

> what would you expect the
> following to do, where we don't know if it is DNA or protein:
>>>> Seq("TAA").startswith("TAN")
>>>> Seq("TAA").startswith("TAX")
> We don't know, therefore we shouldn't guess, so I think these would
> have to raise an error.

That's one option, and likely the sanest - the error would probably provoke
the user into specifying an alphabet, at least.  However, there's no harm in
discussing other options, even if none of us like them...

If the sequence has an alphabet that specifies it as either Protein or
Nucleotide, then in those cases we can infer clearly what the ambiguity
symbol means, and there is no problem.  If the sequence does not have such
an alphabet, then we can potentially consider Seq.startswith() to behave
like String.startswith(), with an appropriate warning.

Alternatively, Seq.startswith() could behave like String.startswith() all
the time, unless passed with an optional argument (e.g. "ambiguity=True").

Or maybe another optional argument could be passed to force the search to
treat a sequence without an alphabet as either "type='protein'" or
"type='RNA'", thereby suppressing the warning/error described above.

> This also applies to the other ambiguous
> nucleotide letters, like S for G or C in nucleotide sequences.  Then
> there are more alphabet corner cases - consider reduced alphabets
> (e.g. simplified protein sequences mapping all acidic residues to a
> single character etc).

...and what if the user makes up their own alphabet?, and so on... ;)

Those would be neither Protein nor DNA/RNA alphabets, and so could do
whatever the default is for Seq.startswith() behaviour in those
circumstances.

Another alternative could be to have an optional argument defining the
ambiguity symbols, and what they represent (e.g.
"ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}).

> Several "Zen of Python" points spring to mind, including "If the
> implementation is hard to explain, it's a bad idea.", but in summary I
> against supporting ambiguous characters in the string-like methods of
> the Seq object (so: find, rfind, split, startswith, endswith, etc).
> We should handle this another way.

If the natural home for this functionality is Bio.Motif, then the natural
home for it is Bio.Motif, and I don't have a problem with that.  I'm happy
to go with the consensus.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________