[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?

Kevin Rue kevin.rue at ucdconnect.ie
Wed Mar 12 23:56:10 UTC 2014


HI Tal,

I just tested your function and it is doing something slightly different
than what I had in mind.

I need a few simple examples to illustrate my point:

The string "TEST" is present in "TESTER" with 0 substitutions/0 insertions.
Therefore I expect the call below to return TRUE.
>>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, max_insertions=0)
but instead it returns FALSE.

Meanwhile,
>>> kevin(str1="TEST", str2="TESTER", max_substitutions=0, max_insertions=2)
returns TRUE

and
>>> kevin("TEST", "TESTER", 0, max_insertions=1)
returns FALSE

Now, I haven't decrypted your code yet, but my guess is that what your
function does is answer the question "Is str1 approximately EQUAL to str2
while allowing a maximum of S substitutions and I insertions".
My problem (and believe most bioinformaticians like Ivan who answered
earlier) is formulated "Is str1 present somewhere in str2 while allowing S
sub and I ins?"


In fact, it's your previous answer that made me realise that the function
solving my problem "str1 in str2 with max of i insertions and s
substitutions" should not be able to solve the problem "str1 in str2 with
max of d deletions and s substitutions".
Meanwhile, you're answer seems right for  the function you sent us "str1 ==
str2 with max of i insertions and s substitutions" should be solved by the
same function called by switching the strings "str2 == str1 with max of d
deletions and s substitutions". (Just a guess, but it makes sense to me)

If I am right, one solution to solve my problem (with only substitutions
and insertions) using your function "kevin" is:
- set str1 as the string I am trying to match
- call your function for each substring of str2 of length [len(str1) :
len(str1)+max_insertions+1] and set each of those as str2
- i can save time by returning TRUE the first time I find a match, because
I don't care if there are more
This would compare str1 to all possible str2 substrings that could be
"approximately EQUAL to str1 allowing up to I insertions and S
substitutions in str1".

Obviously, another option is to design another function (say.. "kevin2"
^_^) which addresses directly my problem. I don't mind using the solution
above (I appreciate your help and time), but I believe an implementation
dealing directly with my problem would be faster to solve it, right?

Looking forward to your answer!
Cheers
Kevin








On 12 March 2014 18:43, Tal Einat <taleinat at gmail.com> wrote:

> On Wed, Mar 12, 2014 at 8:28 PM, Kevin Rue <kevin.rue at ucdconnect.ie>
> wrote:
> > HI Tal,
> >
> > To answer your question, I currently don't expect any deletion indeed.
> It is
> > possible that other users would like that feature, maybe me in the
> future.
> > Therefore, this code could be a very convenient temporary fix that I
> could
> > include directly in the code as an annex function. Thank you very much.
> I'll
> > have a closer look at how it works to teach myself :)
>
> If anyone else on this list thinks this would be useful, I'd be happy
> to publish it as a publicly available library. For now, consider the
> code I posted freely available in the public domain (i.e. use it at
> will but don't take credit for it in my stead and don't sell it
> without my consent).
>
> > Now, in a moment of inspiration, am I wrong in saying that similarly, a
> case
> > where no insertion is allowed would be the same problem while switching
> > sequence1 and sequence2 in the function call?
>
> Indeed, you are correct :)
>
> - Tal Einat
>



-- 
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en




More information about the Biopython mailing list