[Biopython] Python equivalent of the Perl String::Approx module for approximate matching?
Kevin Rue
kevin.rue at ucdconnect.ie
Wed Mar 12 15:16:03 UTC 2014
Hi,
@Ivan: Glad to hear you confirm my thought!
@Saket: You're right.. I have already been in touch for the past two days
with "taleinat" the person who developped that code :) You will see in his
github that in agreement with him, I suggested my feature as a possible
enhancement of his package (issue #2
https://github.com/taleinat/fuzzysearch/issues), and he agreed to consider
it for future development. No promised release date, but:
1) I wouldn't dare to ask for one as I am already asking for a huge favor
for someone else to program that "for me" and the community
2) I am not particularly rushed, his Levenshtein distance does an
acceptable job for the time being. I would love to be able to write the
code myself, but my PhD thesis is more about using scripts to gain biology
knowledge, while my issue would be better dealt with by someone with a much
stronger low-level programming skillset using abstract mathematical notions
to optimise the code beyond anything I could do with my scripting skills.
Cheers
Kevin
PhD candidate :)
On 12 March 2014 13:46, Saket Choudhary <saketkc at gmail.com> wrote:
> Hi Kevin,
>
> There is a package which does something similar.
>
> https://github.com/taleinat/fuzzysearch
>
>
> Saket
>
> On 12 March 2014 11:32, Kevin Rue <kevin.rue at ucdconnect.ie> wrote:
> > Hi all,
> >
> > Some may consider this a repeat of my StackOverflow post (
> >
> http://stackoverflow.com/questions/22328884/python-equivalent-of-the-perl-stringapprox-amatch-function
> )
> > but over there I didn't mention the possibility of implementing the
> feature
> > in Biopython.
> >
> > I am looking for a function which, given sequence1 and sequence2, would
> > return whether sequence1 matches a subsequence of sequence2 allowing up
> to
> > I insertions, D deletions, and S substitutions.
> >
> > So far, all I could find in Python were fuzzy matching functions using
> edit
> > distances (Levenshtein and others), but none of those distances
> distinguish
> > between insertions, deletions and substitution (
> >
> http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
> > ).
> >
> > There is a Perl module called String::Approx (
> > http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm), where the
> > function amatch() does exactly what I want.. except in Perl. A
> > quick-and-dirty fix could be to make an external call to that Perl
> function
> > from my Python script, but it would be so much cleaner (and probably
> > faster) if I could avoid external calls and being dependent on multiple
> > interpreters.
> >
> > I believe that such the feature I described could rapidly become popular
> if
> > implemented in Biopython, but after reading the Perl module code and not
> > understanding most of it, I think any Python module I could write to do
> the
> > job wouldn't be nearly as optimised and fast. (an external call to the
> Perl
> > module would surely be faster than my Python implementation)
> >
> > So....
> > - What are your thoughts?
> > - Did I miss the magic Python package that does what I want?
> > - Does anyone else think such a package would be useful to the
> > bioinformatics community?
> > - Did anyone solve the same issue I'm having in a different way? (I
> haven't
> > found an "think out of the box" idea yet)
> > - Does anyone feel like implementing this feature?
> >
> > Many thanks for your advice!
> >
> >
> > --
> > Kévin RUE-ALBRECHT
> > Wellcome Trust Computational Infection Biology PhD Programme
> > University College Dublin
> > Ireland
> > http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en
> >
> > _______________________________________________
> > Biopython mailing list - Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
--
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en
More information about the Biopython
mailing list