[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?

Wed Mar 12 18:00:16 UTC 2014

> Kevin wrote:
>
> I am looking for a function which, given sequence1 and sequence2, would
> return whether sequence1 matches a subsequence of sequence2 allowing up to
> I insertions, D deletions, and S substitutions.

Kevin, if you don't want to allow deletions at all (I got the
impression this is what you're looking for) then the following code
will do the trick (and should be quite fast).

Is this generally useful, or would it usually be more useful to also
allow a (possibly limited) number of deletions?

from array import array

def kevin(str1, str2, max_substitutions, max_insertions):
    """check if it is possible to transform str1 into str2 given limitations

    The limiations are the maximum allowed number of new characters inserted
    and the maximum allowed number of character substitutions.
    """
    # check simple cases which are obviously impossible
    if not len(str1) <= len(str2) <= len(str1) + max_insertions:
        return False

    scores = array('L', [0] * (len(str2) - len(str1) + 1))
    new_scores = scores[:]

    for (str1_idx, char1) in enumerate(str1):
        # make min() always take the other value in the first iteration of the
        # inner loop
        prev_score = len(str2)
        for (n_insertions, char2) in enumerate(
                str2[str1_idx:len(str2)-len(str1)+str1_idx+1]
        ):
            new_scores[n_insertions] = prev_score = min(
                scores[n_insertions] + (0 if char1 == char2 else 1),
                prev_score
            )

        # swap scores <-> new_scores
        scores, new_scores = new_scores, scores

    return min(scores) <= max_substitutions

- Tal