[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?

Kevin Rue kevin.rue at ucdconnect.ie
Thu Mar 13 18:08:18 UTC 2014


Hi Tal,

I finally had the time to look at your function in details, and understood
how it works and why each line is there. Thanks, even though it doesn't do
exactly what I had in mind, it does what the docstring says :)
I'd call your "kevin" function "approxEqual(str1, str2, max_ins, max_sub)".


When I understood your code, I thought of a way to increase the speed of
your function, but ended with a (fast) function actually doing something
slightly different. This one would actually only look at the str2
substrings from str1_idx but which are no longer than
len(str1)+max_insertions. Longer str2 are pointless as they imply that str1
requires more insertions at the start than allowed to match str2.
I would call that function "approxStartsWith(str1, str2, max_ins, max_sub)"

def approxStartsWith(str1, str2, max_substitutions, max_insertions):
    """check if it is possible to map str1 to the start of str2 given
limitations

    The limitations are the maximum allowed number of new characters
inserted
    and the maximum allowed number of character substitutions.
    """
    # check simple cases which are obviously impossible
    if not len(str1) <= len(str2):
        return False

    scores = array('L', [0] * (max_insertions + 1))
    new_scores = scores[:]

    for (str1_idx, char1) in enumerate(str1):
        # make min() always take the other value in the first iteration of
the
        # inner loop
        prev_score = len(str2)
        for (n_insertions, char2) in enumerate(
                str2[str1_idx:max_insertions+str1_idx+1]
        ):
            new_scores[n_insertions] = prev_score =
min(scores[n_insertions] + (0 if char1 == char2 else 1), prev_score )

        # swap scores <-> new_scores
        scores, new_scores = new_scores, scores

    return min(scores) <= max_substitutions


Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply
implemented as:

def approxWithin(str1, str2, max_ins, max_sub):
"""check if it is possible to find str1 within str2 given limitations

The limitations are the maximum allowed number of new characters inserted
and the maximum allowed number of character substitutions.
"""
for str2_idx in range(len(str2)-len(str1)+1):
print (str2_idx)
result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub)
if result:
return result
return False

Comments?

Cheers,
kevin







On 12 March 2014 23:56, Kevin Rue <kevin.rue at ucdconnect.ie> wrote:

> HI Tal,
>
> I just tested your function and it is doing something slightly different
> than what I had in mind.
>
> I need a few simple examples to illustrate my point:
>
> The string "TEST" is present in "TESTER" with 0 substitutions/0
> insertions. Therefore I expect the call below to return TRUE.
> >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0,
> max_insertions=0)
> but instead it returns FALSE.
>
> Meanwhile,
> >>> kevin(str1="TEST", str2="TESTER", max_substitutions=0,
> max_insertions=2)
> returns TRUE
>
> and
> >>> kevin("TEST", "TESTER", 0, max_insertions=1)
> returns FALSE
>
> Now, I haven't decrypted your code yet, but my guess is that what your
> function does is answer the question "Is str1 approximately EQUAL to str2
> while allowing a maximum of S substitutions and I insertions".
> My problem (and believe most bioinformaticians like Ivan who answered
> earlier) is formulated "Is str1 present somewhere in str2 while allowing S
> sub and I ins?"
>
>
> In fact, it's your previous answer that made me realise that the function
> solving my problem "str1 in str2 with max of i insertions and s
> substitutions" should not be able to solve the problem "str1 in str2 with
> max of d deletions and s substitutions".
> Meanwhile, you're answer seems right for  the function you sent us "str1
> == str2 with max of i insertions and s substitutions" should be solved by
> the same function called by switching the strings "str2 == str1 with max of
> d deletions and s substitutions". (Just a guess, but it makes sense to me)
>
> If I am right, one solution to solve my problem (with only substitutions
> and insertions) using your function "kevin" is:
> - set str1 as the string I am trying to match
> - call your function for each substring of str2 of length [len(str1) :
> len(str1)+max_insertions+1] and set each of those as str2
> - i can save time by returning TRUE the first time I find a match, because
> I don't care if there are more
> This would compare str1 to all possible str2 substrings that could be
> "approximately EQUAL to str1 allowing up to I insertions and S
> substitutions in str1".
>
> Obviously, another option is to design another function (say.. "kevin2"
> ^_^) which addresses directly my problem. I don't mind using the solution
> above (I appreciate your help and time), but I believe an implementation
> dealing directly with my problem would be faster to solve it, right?
>
> Looking forward to your answer!
> Cheers
> Kevin
>
>
>
>
>
>
>
>
> On 12 March 2014 18:43, Tal Einat <taleinat at gmail.com> wrote:
>
>> On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue <kevin.rue at ucdconnect.ie>
>> wrote:
>> > HI Tal,
>> >
>> > To answer your question, I currently don't expect any deletion indeed.
>> It is
>> > possible that other users would like that feature, maybe me in the
>> future.
>> > Therefore, this code could be a very convenient temporary fix that I
>> could
>> > include directly in the code as an annex function. Thank you very much.
>> I'll
>> > have a closer look at how it works to teach myself :)
>>
>> If anyone else on this list thinks this would be useful, I'd be happy
>> to publish it as a publicly available library. For now, consider the
>> code I posted freely available in the public domain (i.e. use it at
>> will but don't take credit for it in my stead and don't sell it
>> without my consent).
>>
>> > Now, in a moment of inspiration, am I wrong in saying that similarly, a
>> case
>> > where no insertion is allowed would be the same problem while switching
>> > sequence1 and sequence2 in the function call?
>>
>> Indeed, you are correct :)
>>
>> - Tal Einat
>>
>
>
>
> --
> Kévin RUE-ALBRECHT
> Wellcome Trust Computational Infection Biology PhD Programme
> University College Dublin
> Ireland
> http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en
>



-- 
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en




More information about the Biopython mailing list