[Biopython] Get all alignments of a sequence against another

Kevin Rue kevin.rue at ucdconnect.ie
Fri Mar 14 15:52:12 UTC 2014


Hi Mary, Tal,

In that case you describe, the solution to your problem is rather
straightforward to implement.. I split it in two functions pasted below.

I actually implemented yesterday the function "approx_substitute(str1,
str2, max_substitutions)", see below. This function requires two strings of
the same length, and will tell you TRUE if there are less than N mismatches
between them, comparing the characters at the same position in the two
strings.

Now the function that answers your question is something I just implemented
for you: "list_start_approx_matches_substitutions(str1, str2,
max_mismatches)" see below.
This function will use the previous one to compare your small_string to
each substring of str2 of the same length of str1, and keep the start
position of all positive matches. It will return an "array" object, which
you can easily turn into a regular list using array.tolist() See
http://docs.python.org/2/library/array.html

In short, run the fwo function definitions below. And then use:
>>>list_start_approx_matches_substitutions("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS",
1)

it will return
Out[20]: array('L', [19, 42, 99])

if that's scary, just run:
>>>list_start_approx_matches_substitutions("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS",
1).tolist()
Out[21]: [19, 42, 99]


Kevin



def approx_substitute(str1, str2, max_substitutions):
    """Checks that str1 is less than max_substitutions away from str2.

    Note: No insertions or deletions are allowed. Sequences of different
length will automatically return FALSE.

    Args:
        str1, str2, max_substitutions

    Returns:
        Boolean. TRUE if str1 is less than ax_substitutions away from str2,
FALSE otherwise.
    """
    # Solves a simple scenario which does not require to parse the
sequences.
    if len(str1) != len(str2):
        return False
    # If we reach here, we know that the two strings are the same length,
therefore len(str1) is synonym to len(str2)
    # Initialise a counter of substitutions between the two strings
    substitutions = 0
    # For each (index, character) value pair in str1
    for (index, str1_char) in enumerate(str1):
        # add 1 if the characters from str1 and str2 are different, add 0
otherwise
        substitutions += (0 if str1_char == str2[index] else 1)
        # time saver: if the counter exceeds max_substitutions at some
stage, don't bother checking the rest...
        if substitutions > max_substitutions:
            # ... just return FALSE
            return False
    # If max_substitutions is never reached, the function will eventually
leave the loop above
    # The simple fact of arriving here proves that str1 is less than
max_substitutions away from str2,
    # therefore return TRUE
    return True

from array import array

# define a function that returns the start position of all matches of a
given str1
# in a given larger str2, given a maximum number of mismatches allowed
def list_start_approx_matches_substitutions(str1, str2, max_mismatches):
# Initialise an empty list to save the start positions of the matches
starts = array('L')
# for each substring of str2 which is the same length as str1
for i in range(len(str2)-len(str1)+1):
# if there are less than N mismatches between str1 and substr2
if approx_substitute(str1, str2[i:i+len(str1)], max_mismatches):
# save the start position of the match (the end position can be guessed
# from the length of str1, the only information lost is the number of
mismatches
# between str1 and substr2)
starts.append(i)
return starts



On 14 March 2014 15:14, Mary Kindall <mary.kindall at gmail.com> wrote:

> Hi Tal and Kevin,
> Thanks for mail and the user friendly package. I was not aware of the
> existence of "fuzzysearch" package.
>
> Kevin, I am allowing the mismatches (to a defined maximum which is a
> function of the length of the pattern) but there is a strict 'no' for
> insertions and deletions.
>
> I do see that the functions 'fuzzysearch.find_near_matches' and
> 'fuzzysearch.find_near_matches_with_ngrams' works perfect for mismatches.
> However, I could not find a way to avoid alignment when there is an
> insertion or deletion.
>
> Is there a way to restrict the maximum distance to mismatches only?
> Levenshtein distance seems to have the same issue.
>
> Thanks and regards
> Mary
>
>
>
>
>
> On Fri, Mar 14, 2014 at 7:11 AM, Tal Einat <taleinat at gmail.com> wrote:
>
>> On Fri, Mar 14, 2014 at 11:16 AM, Kevin Rue <kevin.rue at ucdconnect.ie>
>> wrote:
>> > >>> import fuzzysearch
>> > >>>
>> fuzzysearch.find_near_matches_with_ngram("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS",
>> >>>> 1)
>>
>> By the way, you should usually just use
>> fuzzysearch.find_near_matches(...), which will choose an appropriate
>> search method for you depending on the given parameters.
>>
>> - Tal Einat
>>
>
>
>
> --
> -------------
> Mary Kindall
> Yorktown Heights, NY
> USA
>



-- 
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en




More information about the Biopython mailing list