[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?

Kevin Rue kevin.rue at ucdconnect.ie
Tue Mar 18 10:34:21 UTC 2014


Hi Tal,

In my particular case, I am searching for a 36-characters sequence in a
90-characters one. If you're really curious, you can have a look at
RNA-sequencing. It's technique in biology where we obtain the sequence of
RNA molecules expressed from the genome. But before the analysis of those
sequences (we usually call them "reads"), we have to identify and filter
out reads that contain an "adapter" sequence used by our machines. The
problem is that the "adapter" sequence can have mistakes in the read, hence
the fuzzy matching.

Haven't tried your code yet, I'll try when I get a chance, but I've got a
few other tasks to deal with for my own work first .

Cheers,
Kevin


On 15 March 2014 18:59, Tal Einat <taleinat at gmail.com> wrote:

> On Thu, Mar 13, 2014 at 8:08 PM, Kevin Rue <kevin.rue at ucdconnect.ie>
> wrote:
> > Hi Tal,
> >
> > I finally had the time to look at your function in details, and
> understood
> > how it works and why each line is there. Thanks, even though it doesn't
> do
> > exactly what I had in mind, it does what the docstring says :)
> > I'd call your "kevin" function "approxEqual(str1, str2, max_ins,
> max_sub)".
> >
> >
> > When I understood your code, I thought of a way to increase the speed of
> > your function, but ended with a (fast) function actually doing something
> > slightly different. This one would actually only look at the str2
> substrings
> > from str1_idx but which are no longer than len(str1)+max_insertions.
> Longer
> > str2 are pointless as they imply that str1 requires more insertions at
> the
> > start than allowed to match str2.
> > I would call that function "approxStartsWith(str1, str2, max_ins,
> max_sub)"
> >
> > def approxStartsWith(str1, str2, max_substitutions, max_insertions):
> >     """check if it is possible to map str1 to the start of str2 given
> > limitations
> >
> >     The limitations are the maximum allowed number of new characters
> > inserted
> >     and the maximum allowed number of character substitutions.
> >     """
> >     # check simple cases which are obviously impossible
> >     if not len(str1) <= len(str2):
> >         return False
> >
> >     scores = array('L', [0] * (max_insertions + 1))
> >     new_scores = scores[:]
> >
> >     for (str1_idx, char1) in enumerate(str1):
> >         # make min() always take the other value in the first iteration
> of
> > the
> >         # inner loop
> >         prev_score = len(str2)
> >         for (n_insertions, char2) in enumerate(
> >                 str2[str1_idx:max_insertions+str1_idx+1]
> >         ):
> >             new_scores[n_insertions] = prev_score =
> min(scores[n_insertions]
> > + (0 if char1 == char2 else 1), prev_score )
> >
> >         # swap scores <-> new_scores
> >         scores, new_scores = new_scores, scores
> >
> >     return min(scores) <= max_substitutions
> >
> >
> > Now, an approxWithin(str1, str2, max_ins, max_sub) could be simply
> > implemented as:
> >
> > def approxWithin(str1, str2, max_ins, max_sub):
> > """check if it is possible to find str1 within str2 given limitations
> >
> > The limitations are the maximum allowed number of new characters inserted
> > and the maximum allowed number of character substitutions.
> > """
> > for str2_idx in range(len(str2)-len(str1)+1):
> > print (str2_idx)
> > result = approxStartsWith(str1, str2[str2_idx:], max_ins, max_sub)
> > if result:
> > return result
> > return False
> >
> > Comments?
>
> Aha! So you are, in fact, searching for a short sequence in a long
> sequence (or many such long sequences). This -is- exactly what
> fuzzysearch is meant for!
>
>
> Regarding the code you posted, it looks like it would work (though I
> haven't tested it). It certainly isn't very efficient, however, since
> it makes many copies of long parts of str2 (see "str2[str2_idx:]"). It
> is also fairly straightforward, leaving some room for further
> optimization. I do like that it is very readable and easy to
> understand! Well, except for the loop in approxStartsWith(), but
> that's based on my own code...
>
>
> Inspired by your use-case, I've added highly generic fuzzy searching
> functionality to fuzzysearch. You can now limit the number of
> substitutions, insertions and deletions as well as their total (i.e.
> limit the Levenshtein distance). You can also limit only some of these
> as you like.
>
> Specifically, this supports your use-case of searching for fuzzy
> matches allowing only a limited number of substitutions and
> insertions, but no deletions.
>
> The user-friendly utility function fuzzysearch.find_near_matches() now
> accepts parameters for limiting the substitutions etc., and chooses a
> suitable implementation based on the given parameters.
>
> I haven't yet implemented an optimized search function allowing only
> substitutions and insertions. If the current version not fast enough
> for your needs, there are plenty of optimizations still to be done.
>
> I'd be happy if you could give it a whirl and tell me what happens! I
> haven't released it yet, but you can install the latest development
> version using pip (you'll need to have git installed for this):
>
> pip install --upgrade
> git+git://github.com/taleinat/fuzzysearch.git#egg=fuzzysearch
>
> - Tal
>



-- 
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en




More information about the Biopython mailing list