[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?

Wed Mar 12 18:28:53 UTC 2014

HI Tal,

To answer your question, I currently don't expect any deletion indeed. It
is possible that other users would like that feature, maybe me in the
future.
Therefore, this code could be a very convenient temporary fix that I could
include directly in the code as an annex function. Thank you very much.
I'll have a closer look at how it works to teach myself :)

You actually (positively) surprised me by taking this approach: I didn't
even think of taking the advantage of 0 deletions in my particular use case
! I can imagine that it does reduce the possible combination of edits. I
guess I was too impressed by the Perl function that seemed to be doing all
at once, but indeed code can be much simpler and faster when one takes
advantage of the known pre-conditions and specific use cases.
I can imagine that a master function could handle Sub, Del and Ins
together, but if Del=0 then it could call this one if it solves the
specific problem faster.

Now, in a moment of inspiration, am I wrong in saying that similarly, a
case where no insertion is allowed would be the same problem while
switching sequence1 and sequence2 in the function call?

PS: your updated fuzzysearch 0.2.0 package works great for its job. I just
haven't had time to check the documentation updates.

Cheers!

On 12 March 2014 18:00, Tal Einat <taleinat at gmail.com> wrote:

> > Kevin wrote:
> >
> > I am looking for a function which, given sequence1 and sequence2, would
> > return whether sequence1 matches a subsequence of sequence2 allowing up
> to
> > I insertions, D deletions, and S substitutions.
>
> Kevin, if you don't want to allow deletions at all (I got the
> impression this is what you're looking for) then the following code
> will do the trick (and should be quite fast).
>
> Is this generally useful, or would it usually be more useful to also
> allow a (possibly limited) number of deletions?
>
>
> from array import array
>
> def kevin(str1, str2, max_substitutions, max_insertions):
>     """check if it is possible to transform str1 into str2 given
> limitations
>
>     The limiations are the maximum allowed number of new characters
> inserted
>     and the maximum allowed number of character substitutions.
>     """
>     # check simple cases which are obviously impossible
>     if not len(str1) <= len(str2) <= len(str1) + max_insertions:
>         return False
>
>     scores = array('L', [0] * (len(str2) - len(str1) + 1))
>     new_scores = scores[:]
>
>     for (str1_idx, char1) in enumerate(str1):
>         # make min() always take the other value in the first iteration of
> the
>         # inner loop
>         prev_score = len(str2)
>         for (n_insertions, char2) in enumerate(
>                 str2[str1_idx:len(str2)-len(str1)+str1_idx+1]
>         ):
>             new_scores[n_insertions] = prev_score = min(
>                 scores[n_insertions] + (0 if char1 == char2 else 1),
>                 prev_score
>             )
>
>         # swap scores <-> new_scores
>         scores, new_scores = new_scores, scores
>
>     return min(scores) <= max_substitutions
>
>
> - Tal
>

-- 
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en