[Biopython] I've written a library for executing fuzzy searches...
c0d3g33k
c0d3g33k at gmail.com
Sun Nov 17 16:24:33 UTC 2013
On 11/17/2013 04:14 AM, Tal Einat wrote:
> There are already many libraries to compute vaiours [various?]
> distance metrics between two strings, but that is not the purpose of
> the library I'm developing (fuzzysearch). My goal is to build a
> library for searching in strings or other sequences (e.g. DNA),
> allowing finding nearly matching parts instead of just full matches.
>
That's what made me think of it. It covers your use case and seems to
be well researched, so I thought it might be of interest as you
implement your own library. From the description (bold mine):
> SimMetrics provides a library of float based similarity measures
> between String Data as well as the typical unnormalised metric output.
>
> It is intended for researchers in information integration, II, and
> other related fields. It includes a range of similarity measures from
> a variety of communities, including statistics, *DNA analysis*,
> artificial intelligence, information retrieval, and databases.
>
Here's a list of the metrics that are implemented:
https://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
The other nice thing from a usability perspective was that it offered
the option of normalised output in addition to the raw output of the
original algorithms, which made it easier to compare results when
running a series of metrics on a given set of strings.
> On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k <c0d3g33k at gmail.com
> <mailto:c0d3g33k at gmail.com>> wrote:
>
> Hi Tal,
>
> This is only tangentially related to your original post, but I
> thought I'd point out the existence of Simmetrics, a Java-based
> similarity metrics library (GPL v2). I thought that at some point
> there was a Python port, but I could be confusing that with using
> the library myself under Jython. Though it is implemented in
> Java, it might provide a solid foundation for a python library/api
> should you find it interesting. It's fairly comprehensive, so it
> might at least provide inspiration for extending your current
> efforts. It seems to be unmaintained at present, but source code
> is available both at the original Sourceforge page and at github
> where someone cloned the project.
>
> http://sourceforge.net/projects/simmetrics/
> https://github.com/Simmetrics/simmetrics
>
>
> Hi,
>
> - Tal
More information about the Biopython
mailing list