[Biopython] Comparing two already aligned sequences quickly to count gaps/matches/mismatches

Peter Cock p.j.a.cock at googlemail.com
Thu Jan 30 07:30:05 EST 2025


Hello all, and Michiel in particular,

I am wondering if any of the pairwise alignment code in Bio.Align (much of
which is written in C for speed) could help with this use case?:

I have lots of pairs of pre-aligned sequences (imported from an external
MSA file), for which I am doing something like this:

```python
def count_matches_etc(query_seq, subject_seq):
    assert len(query_seq) == len(subject_seq), "Should be same length"
    matches = non_gap_mismatches = either_gapped = both_gapped = 0
    for q, s in zip(query_seq, subject_seq, strict=True):
        if q == "-" and s == "-":
            both_gapped += 1
        elif q == "-" or s == "-":
            either_gapped += 1
        elif q == s:
            matches += 1
        else:
            non_gap_mismatches += 1
    assert matches + non_gap_mismatches + either_gapped + both_gapped ==
len(query_seq)
    return matches, non_gap_mismatches, either_gapped, both_gapped


# Test case
assert (9, 1, 2, 1) == count_matches_etc("ACGTAC-TAC-GT", "AGGT-CGTAC-GT")
```

Sticking with Python that could be optimized (e.g. I am currently using
this with sequences of a million base pairs but few gaps), however I have
written this example with clarity foremost in mind.

Thank you,

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20250130/e00eeb2f/attachment.htm>


More information about the Biopython mailing list