[BioPython] Adding startswith and endswith methods to the Seq object

Mon Apr 13 14:46:31 UTC 2009

On Mon, Apr 13, 2009 at 3:10 PM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
>
> Howdo,
>
>> Does this seem like a sensible addition to the Seq object?  It is
>> consistent with making the Seq object more like a python string.
>
> Yes it does seem sensible.

Good :)

> I'd quite like to (eventually) have the capability either to provide ambiguity
> symbols, or to query with a regular expression along the lines of
> re.match() (or maybe the nonexistent re.endmatch()).
>
> Since this isn't implemented yet, maybe there's still time to consider this
> potential usage in the implementation?

I'm not at all happy about the idea of supporting ambiguity characters
in these string-like methods of the Seq object.  Right now I was
proposing nothing special with ambiguity symbols, so:

>>> from Bio.Seq import Seq
>>> Seq("TAN").startswith("TAN")
True
>>> Seq("TAA").startswith("TAN")
False
>>> Seq("TAA").startswith("TAX")
False

I agree this doesn't cover all possible use cases, but it is very
simple, and easy to explain.

Trying to support ambiguity symbols will be alphabet dependent
(consider the above example could be a protein or DNA), and frankly
extremely complicated.  It also breaks the "act like a string" idea.
Essentially you'd be asking for the following:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, generic_protein
>>> Seq("TAN", generic_dna).startswith("TAN")
True
>>> Seq("TAA", generic_dna).startswith("TAN") #treat N specially
True
>>> Seq("TAN", generic_protein).startswith("TAN")
True
>>> Seq("TAA", generic_protein).startswith("TAN") #protein, so N is a normal amino acid
False
>>> Seq("TAX", generic_protein).startswith("TAX")
True
>>> Seq("TAA", generic_protein).startswith("TAX") #treat X specially
True

So far that is at least understandable - but what would you expect the
following to do, where we don't know if it is DNA or protein:
>>> Seq("TAA").startswith("TAN")
>>> Seq("TAA").startswith("TAX")
We don't know, therefore we shouldn't guess, so I think these would
have to raise an error.  This also applies to the other ambiguous
nucleotide letters, like S for G or C in nucleotide sequences.  Then
there are more alphabet corner cases - consider reduced alphabets
(e.g. simplified protein sequences mapping all acidic residues to a
single character etc).

Several "Zen of Python" points spring to mind, including "If the
implementation is hard to explain, it's a bad idea.", but in summary I
against supporting ambiguous characters in the string-like methods of
the Seq object (so: find, rfind, split, startswith, endswith, etc).
We should handle this another way.

Bartek: would Bio.Motif give us a nice way to do these kinds of
searches?  For example, given a simple nucleotide motif of "TAN"
(which should match TAA, TAC, TAG or TAA) or "TAS" (which should match
"TAC" or "TAG"), can we check if this matches at the start of a target
nucleotide sequence?  And similarly for protein motifs (e.g. signal
peptides).

Peter