[Biopython] Get all alignments of a sequence against another

Kevin Rue kevin.rue at ucdconnect.ie
Fri Mar 14 09:29:26 UTC 2014


Sorry for multiple emails:

My mistake, the duplication of the last one does not replace the one before
the last, but instead the first match is simply not returned in the output
list (even though the right NUMBER of matches is returned).


On 14 March 2014 09:16, Kevin Rue <kevin.rue at ucdconnect.ie> wrote:

> Hi Mary,
>
> There is one blurry area in your question: how exactly do you define "a
> location where your small_sequence aligns" ?
> From your example, it seems you're not looking for exact matches, but you
> allow in this case 1 mismatch. Is it a maximal number of mismatches? Do you
> also want to allow indels? Do you want to control the number of insertions,
> deletions, substitutions separately? Is a match a local alignment above a
> score threshold?
>
> I would suggest that you have a look at the definition of the Levenshtein
> distance.( see the example:
> http://en.wikipedia.org/wiki/Levenshtein_distance#Example).
> If this metric suits you, for instance to find all the matches of the
> small_sequences in the large_sequence with a maximal edit distance of 1,
> you can use one of the Python packages implementing the Levenshtein
> distance, like "fuzzysearch" (
> https://pypi.python.org/pypi/fuzzysearch/0.2.0) this way:
>
> >>> import fuzzysearch
> >>>
> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTLTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS",
> 1)
>
> The output will find two matches.
> Out[7]: [Match(start=89, end=99, dist=0), Match(start=89, end=99, dist=0)]
>
> BUG:
> I did notice that the second match is reported twice instead and I assume
> this is a bug where the first match was somehow replaced by the second,
> which is why I copied Tal (the developer of this package) to this email
>
> Another example where I added you sequence (with a mismatch) a third time:
>
> >>>
> fuzzysearch.find_near_matches_with_ngrams("GGGTTLTTSS","XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS",
> 1)
>
> returns
> Out[9]:
> [Match(start=42, end=52, dist=1),
>  Match(start=99, end=109, dist=0),
>  Match(start=99, end=109, dist=0)]
>
> You can see three matches, one of the mismatched sequence was detected
> correctly (edit distance of 1), but the bug seems to duplicate the last
> match and replace the one before the last match with it.
>
> Tal, can you fix that? I will add the issue to your repository :)
>
> Cheers
> Kevin
>
>
>
>
> On 13 March 2014 19:57, Mary Kindall <mary.kindall at gmail.com> wrote:
>
>> This is a primitive question but somehow I could not find a solution to
>> it.
>> I have two sequences 'large' and 'small' as given below.
>>
>> >large
>>
>> XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS
>>
>>
>> >small
>> GGGTTVTTSS
>>
>>
>> I need to align the 'small' sequence to the 'large' sequence. Clearly
>> there
>> are two places where it can be aligned. I need to get indices of both the
>> locations. I was trying BioPython's "pairwise2.align.globalms" function
>> but
>> it is only able to align to the second position.
>>
>>
>>
>> pairwise2.align.globalms(largeStr, smallStr, 2, -1, -1, 0,
>> penalize_end_gaps=False)
>> Ans:
>>
>> [('XXXXXXXXXXXXXXXXXXXGGGTTVTTSSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBGGGTTLTTSS',
>>
>> '-----------------------------------------------------------------------------------------GGGTTLTTSS',
>> 20.0,
>> 0,
>> 99)]
>>
>>
>>
>> Which parameter can I change here or which other pachage/lightweight free
>> software can compute this?
>>
>> --
>> Mary
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
> --
> Kévin RUE-ALBRECHT
> Wellcome Trust Computational Infection Biology PhD Programme
> University College Dublin
> Ireland
> http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en
>



-- 
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en




More information about the Biopython mailing list