[BioPython] Adding startswith and endswith methods to the Seq object

Wed Apr 15 23:36:02 UTC 2009

Hi all,

Sorry, but I've missed this thread completely (despite being called by
name a few times).

It's too late for me to address the multiple points raised here, so
I'll try to summarize what I understood:

Peter wants to have the Seq. startswith function and do stuff like:
>if record.seq.startswith(primer) :
>  record = record[crop:]

Leighton would like to have an even more powerful method which would
do things like:
>>> Seq("TAG").startswith("TA[CG]")

Which is quite cool, but Peter raises objections to the semantics of
startswith called with arbitrary strings.

I think that the issue would be resolved if the startswith method
would not accept strings, but Seqs or Motifs.
Assuming that we would have a nice way of generating appropriate
motifs, it would lead to simple code:

m=Motif.from_IUPAC("TAN")

or alternatively

m=Motif.from_re("TA[C|G]")

s.startswith(m)

Currently there are no methods from_IUPAC or from_re, but it should be
fairly straightforward to implement them (if there is interest).

writing the startswith method using a motif instance is very straightforward.

There is one caveat: implementing complex regexps with Bio.Motif might
be not as efficient as using regexps directly, but again I could work
on improving the Motif class.

hope this helps

cheers
  Bartek

On Mon, Apr 13, 2009 at 7:04 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Apr 13, 2009 at 4:46 PM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
>> However, there's no harm in discussing other options, even if none of
>> us like them...
>>
>> If the sequence has an alphabet that specifies it as either Protein or
>> Nucleotide, then in those cases we can infer clearly what the ambiguity
>> symbol means, and there is no problem.
>
> Strictly speaking, only if the sequence has an (ambiguous) IUPAC
> alphabet can we know what the (ambiguity) symbols mean with certainty.
>  If the sequence has only a generic DNA/RNA/Nucleotide/Protein
> alphabet then we can only make a pretty good guess.
>
>> Alternatively, Seq.startswith() could behave like String.startswith() all
>> the time, unless passed with an optional argument (e.g. "ambiguity=True").
>
> That idea could work.  The default behaviour would be "act like a
> string", but an optional argument to
> startswith/endswith/find/rfind/count/... could enable ambiguity
> matching (provided the sequence has a suitable alphabet).  This would
> be backwards compatible, and allow us to forge ahead with adding
> simple string-like startswith/endswith methods now (which are useful
> as is, and so far everyone seems supportive of), and implement
> ambiguity support later.
>
>> Or maybe another optional argument could be passed to force the search to
>> treat a sequence without an alphabet as either "type='protein'" or
>> "type='RNA'", thereby suppressing the warning/error described above.
>> ...
>> Another alternative could be to have an optional argument defining the
>> ambiguity symbols, and what they represent (e.g.
>> "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}).
>
> If we go down the optional argument route (e.g. ambiguity=True), then
> a way of specifying the sequence type or ambiguity characters might be
> possible, although I'd prefer to encourage more rigorous use of
> alphabets in Seq objects in the first place (see also enhancement Bug
> 2597, http://bugzilla.open-bio.org/show_bug.cgi?id=2597 on this
> topic).
>
> If we consider the situation where someone creates their own custom
> alphabet, and wants to define their own ambiguity characters, I think
> any ambiguous search functionality would have to interrogate the
> alphabet object at run time.  Possible, but a bit tricky.
>
>>> Several "Zen of Python" points spring to mind, including "If the
>>> implementation is hard to explain, it's a bad idea.", but in summary I
>>> against supporting ambiguous characters in the string-like methods of
>>> the Seq object (so: find, rfind, split, startswith, endswith, etc).
>>> We should handle this another way.
>>
>> If the natural home for this functionality is Bio.Motif, then the natural
>> home for it is Bio.Motif, and I don't have a problem with that.  I'm happy
>> to go with the consensus.
>
> Well, let's hear what Bartek has to say (Bio.Motif author).
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>

-- 
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433