[BioPython] Adding startswith and endswith methods to the Seq object
Peter
peter at maubp.freeserve.co.uk
Mon Apr 13 15:56:35 UTC 2009
Leighton Pritchard
>> I'd quite like to (eventually) have the capability either to provide ambiguity
>> symbols, or to query with a regular expression along the lines of
>> re.match() (or maybe the nonexistent re.endmatch()).
>>
>> Since this isn't implemented yet, maybe there's still time to consider this
>> potential usage in the implementation?
Peter wrote:
> [Stuff about issues with the alphabet altering the behaviour] ..., but in
> summary I am against supporting ambiguous characters in the string-like
> methods of the Seq object (so: find, rfind, split, startswith, endswith, etc).
> We should handle this another way.
>
> Bartek: would Bio.Motif give us a nice way to do these kinds of
> searches? For example, given a simple nucleotide motif of "TAN"
> (which should match TAA, TAC, TAG or TAA) or "TAS" (which should match
> "TAC" or "TAG"), can we check if this matches at the start of a target
> nucleotide sequence? And similarly for protein motifs (e.g. signal
> peptides).
This feels like a rehash of some of the debate on Bug 2601 doesn't it?
http://bugzilla.open-bio.org/show_bug.cgi?id=2601
On Bug 2601 comment 5, Leighton wrote:
>> I think the abstract distinction between search types here is:
>>
>> 1) Find match at start of sequence (re.match() and string.startswith())
>> 2) Find first match in sequence (re.search() and string.find())
>> 3) Find all non-overlapping matches in sequence (re.finditer() only)
>> 4) Find all overlapping matches in sequence (neither re nor string)
>> 1a) 2a) 3a) 4a) The same, but in the reverse complement.
>>
>> Moving down the list, the problem becomes more general. The type
>> of search I need most often in biological sequences is number (4a),
>> or (4) for proteins. Each of search types (1) to (3) (a or not) has a
>> theoretically faster implementation than doing (4) then filtering the
>> results. I don't mind having more than one search method with
>> different names, or having to specify arguments to get a particular
>> kind of search. I do mind not having (4a) as an option...
Bartek, can Bio.Motif address these four (or eight) questions from
Leighton, or am I expecting the wrong things from it?
Peter
More information about the Biopython
mailing list