[Biopython] Get all alignments of a sequence against another

Fri Mar 14 18:01:51 UTC 2014

On Fri, Mar 14, 2014 at 5:52 PM, Kevin Rue <kevin.rue at ucdconnect.ie> wrote:
> Hi Mary, Tal,
>
> In that case you describe, the solution to your problem is rather
> straightforward to implement.. I split it in two functions pasted below.

Kevin, that's a nice solution!

Here's a somewhat more efficient solution, based on the same basic
principals but implemented with some optimizations and non-trivial
index juggling. This will be included in future versions of
fuzzysearch. Raw code is below; for the next month or so a highlighted
version can be found at: dpaste.com/1728155/

I still feel we're reinventing the wheel here. Surely it is possible
to do this with BioPython. Unfortunately, I too couldn't easily figure
out how to do so from reading the documentation and a bit of trial and
error.

- Tal Einat

from collections import deque, defaultdict, namedtuple
from itertools import islice

Match = namedtuple('Match', ['start', 'end', 'dist'])

def find_near_matches_only_substitutions(subsequence, sequence,
                                         max_substitutions):
    """search for near-matches of subsequence in sequence

    This searches for near-matches, where the nearly-matching parts of the
    sequence must meet the following limitations (relative to the subsequence):

    * the number of character substitutions must be less than max_substitutions
    * no deletions or insertions are allowed
    """
    if not subsequence:
        raise ValueError('Given subsequence is empty!')

    # simple optimization: prepare some often used things in advance
    _SUBSEQ_LEN = len(subsequence)
    _SUBSEQ_LEN_MINUS_ONE = _SUBSEQ_LEN - 1

    # prepare quick lookup of where a character appears in the subsequence
    char_indexes_in_subsequence = defaultdict(list)
    for (index, char) in enumerate(subsequence):
        char_indexes_in_subsequence[char].append(index)

    # we'll iterate over the sequence once, but the iteration is split into two
    # for loops; therefore we prepare an iterator in advance which will be used
    # in for of the loops
    sequence_enum_iter = enumerate(sequence)

    # We'll count the number of matching characters assuming various attempted
    # alignments of the subsequence to the sequence. At any point in the
    # sequence there will be N such alignments to update. We'll keep
    # these in a "circular array" (a.k.a. a ring) which we'll rotate after each
    # iteration to re-align the indexing.

    # Initialize the candidate counts by iterating over the first N-1 items in
    # the sequence. No possible matches in this step!
    candidates = deque([0], maxlen=_SUBSEQ_LEN)
    for (index, char) in islice(sequence_enum_iter, _SUBSEQ_LEN_MINUS_ONE):
        for subseq_index in [idx for idx in
char_indexes_in_subsequence[char] if idx <= index]:
            candidates[subseq_index] += 1
        candidates.appendleft(0)

    matches = []

    # From the N-th item onwards, we'll update the candidate counts exactly as
    # above, and additionally check if the part of the sequence whic began N-1
    # items before the current index was a near enough match to the given
    # sub-sequence.
    for (index, char) in sequence_enum_iter:
        for subseq_index in char_indexes_in_subsequence[char]:
            candidates[subseq_index] += 1

        # rotate the ring of candidate counts
        candidates.rotate(1)
        # fetch the count for the candidate which started N-1 items ago
        n_substitutions = _SUBSEQ_LEN - candidates[0]
        # set the count for the next index to zero
        candidates[0] = 0

        # if the candidate had few enough mismatches, yield a match
        if n_substitutions <= max_substitutions:
            matches.append(Match(
                start=index - _SUBSEQ_LEN_MINUS_ONE,
                end=index + 1,
                dist=n_substitutions,
            ))

    return matches