[BioPython] Adding startswith and endswith methods to the Seq object

Peter biopython at maubp.freeserve.co.uk
Mon Apr 13 17:04:41 UTC 2009


On Mon, Apr 13, 2009 at 4:46 PM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> However, there's no harm in discussing other options, even if none of
> us like them...
>
> If the sequence has an alphabet that specifies it as either Protein or
> Nucleotide, then in those cases we can infer clearly what the ambiguity
> symbol means, and there is no problem.

Strictly speaking, only if the sequence has an (ambiguous) IUPAC
alphabet can we know what the (ambiguity) symbols mean with certainty.
 If the sequence has only a generic DNA/RNA/Nucleotide/Protein
alphabet then we can only make a pretty good guess.

> Alternatively, Seq.startswith() could behave like String.startswith() all
> the time, unless passed with an optional argument (e.g. "ambiguity=True").

That idea could work.  The default behaviour would be "act like a
string", but an optional argument to
startswith/endswith/find/rfind/count/... could enable ambiguity
matching (provided the sequence has a suitable alphabet).  This would
be backwards compatible, and allow us to forge ahead with adding
simple string-like startswith/endswith methods now (which are useful
as is, and so far everyone seems supportive of), and implement
ambiguity support later.

> Or maybe another optional argument could be passed to force the search to
> treat a sequence without an alphabet as either "type='protein'" or
> "type='RNA'", thereby suppressing the warning/error described above.
> ...
> Another alternative could be to have an optional argument defining the
> ambiguity symbols, and what they represent (e.g.
> "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}).

If we go down the optional argument route (e.g. ambiguity=True), then
a way of specifying the sequence type or ambiguity characters might be
possible, although I'd prefer to encourage more rigorous use of
alphabets in Seq objects in the first place (see also enhancement Bug
2597, http://bugzilla.open-bio.org/show_bug.cgi?id=2597 on this
topic).

If we consider the situation where someone creates their own custom
alphabet, and wants to define their own ambiguity characters, I think
any ambiguous search functionality would have to interrogate the
alphabet object at run time.  Possible, but a bit tricky.

>> Several "Zen of Python" points spring to mind, including "If the
>> implementation is hard to explain, it's a bad idea.", but in summary I
>> against supporting ambiguous characters in the string-like methods of
>> the Seq object (so: find, rfind, split, startswith, endswith, etc).
>> We should handle this another way.
>
> If the natural home for this functionality is Bio.Motif, then the natural
> home for it is Bio.Motif, and I don't have a problem with that.  I'm happy
> to go with the consensus.

Well, let's hear what Bartek has to say (Bio.Motif author).

Peter




More information about the Biopython mailing list