[Biopython] I've written a library for executing fuzzy searches...

Mon Nov 18 17:44:02 UTC 2013

Hi Tal,
  meanwhile landed in my Inbox other emails in this thread. I really think you should
update the README file in your project and emphasize the goals and, notably, provide
some comparison to other, existing tools. Personally I would like to read that first
before contributing yet another tool. I somewhat expected that you rather tell me what
is good or bad with pyre2 and that you could quickly spot what is better in your approach
compared to something else. The simmetrics project mentioned by c0d3g33k at gmail.com
is only making me wonder why did you startup fuzzysearch at all. However, I am a biologist
by heart, or at least, more a biologist then an informatician/programmer.

  I recognize several important properties I would like to use, potentially:

1. Support multiple matches in the target string (want to get coordinates and the matched
   string).
2. To gain speed, sometimes I want to direct whatever tool to e.g. give me just the very
   leftmost or the very rightmost matching region.
3. Ability to force more compact alignments (to overcome cases when a wider but weaker alignment
   scores better than a shorter one).
4. User could specify max number of serious differences as counts or percentages of the query
   length or target sequence length or alignment length. Similarly, number of weak differences
   (read further below).
5. I work with 454-based data. Maybe your tool could help with rough searches through them.
   Some examples below, the gap opening/extension penalties are a wild guess from top of my head,
   I suspect several additional penalties will be needed to get thing working. Here are some
   sequences (weak):

1    gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
2    gactaactggtgtataagcgatgactatatgAacaaaaaaaaaaaaaaaaaaaaaaaaa
3    gactaactggtgtataagcgatgactatatgAAacaaaaaaaaaaaaaaaaaaaaaaaaa
4    gactaactggtgtataagcgatgactatatAgAacaaaaaaaaaaaaaaaaaaaaaaaaa
5    gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaaaaaaaaaaaaaaa
6    gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaGaaaaCaaaaaaaaaa
7    gactaactggGtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
8    gactaactg tgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
9    gactaactggtgtataagcgatgactatAatgAacaaaaaaaaaaaaaaaaaaaaaaaaa
10   GgactaactggtgtataagcgatgactatatgacaaaaaaaaaGATCGANGTACTGA
11   Ggactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaa
12   gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaNNNNNNNNNNNNNN

   The modifications are in uppercased letters. The 454 but also IonTorrent suffers from so called
   CAFIE and OVERCALL and UNDERCALL errors, which I showed in the examples above. A simple, algorithmically
   static (just summing up differences) distance metrics is not helpful here, we need something more clever
   so that all the examples above are recognized as matching. For example, I would penalize A in -3 or -2
   position from the aaaaaaaaaaaaaaaaaaaaaaaaa only minimally or not at all (rows 2 and 3). Likewise, A in
   -5 position (4th row). Likewise, the CAFIE errors occur in plus positions +2, +3 (not shown).

   In contrary, a significant penalty should be assigned to these cases (serious differences):
13    gactaactggCtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
14    gactaGactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
15    gactaactggtgtataagcgatgactatatgacaTaaaaaaaaaaaaaaaaaaaaaaaa
16    gactaactggtgtataagcgatgactatatgaGcaaaaaaaaaaaaaaaaaaaaaaaaa

   I do not know what Bastien C. has invented for mira assembler but it has some builtin
   editor so maybe you could ask him for details so that you do not re-invent the wheel.
   It must be using some internal scoring algorithm to do something like what I am asking
   here.

Martin

Tal Einat wrote:
> Hi Martin!
> 
> I'm really excited to get such a response! I would love feedback and suggestions on how this could be made more useful for Biological uses. If you could expand on specific biological use-cases and their details, for example, that would be lovely!
> 
> - Tal
> 
> 
> 
> On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz <mailto:mmokrejs at fold.natur.cuni.cz>> wrote:
> 
>     Hello Tal,
>       it is interesting. I needed something like this a while ago and the alternatives
>     were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems
>     with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment.
>       I would prefer you keep fuzzysearch as a separate package and biopython just import
>     it, as an optional dependency. There is lot more people looking for fuzzy search tools
>     under python and no reason to hide it under biopython. Search for Longest Common Sequence
>     (LCS) on the internet.
>       Finally, I lack any comparison to existing tools in the README. ;-) Would you mind
>     looking into that?
> 
>       I should be able to give some more feedback later on if you want, in respect to biology.
>     I would ask for something looser in searches to overcome under-called and over-called
>     nucleotides in 454 sequences. The Levenshtein is not the best measure for these data
>     and we need something respecting more the reality.
>     Martin
> 
>     Tal Einat wrote:
>     > Hi everyone,
>     >
>     > (I'm not on this list, so please make sure to reply to me as well as the
>     > list.)
>     >
>     > In response to a stackoverflow
>     > question<http://stackoverflow.com/questions/19725127/>,
>     > I've written a Python library for fuzzy searches called
>     > 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
>     > Currently, it allows searching for a string inside a longer string,
>     > returning the best sub-string which match up to a given maximum Levenshtein
>     > distance. This is done quite efficiently, and there is more optimization to
>     > be done, as needed.
>     >
>     > Is there any interest in this library and its further development? One
>     > thing which I think might be useful is support for BioPython Sequence types.
>     >
>     > This is open-source with a very liberal license (the MIT license).
>     >
>     > I'd be happy to collaborate on this!
>     >
>     > - Tal Einat
>     > _______________________________________________
>     > Biopython mailing list  -  Biopython at lists.open-bio.org <mailto:Biopython at lists.open-bio.org>
>     > http://lists.open-bio.org/mailman/listinfo/biopython
>     >
>     >
> 
>