[Biopython] comparing sequences.qustion

Eric Talevich eric.talevich at gmail.com
Wed Feb 8 01:50:28 UTC 2012


On Tue, Feb 7, 2012 at 8:01 PM, George Devaniranjan
<devaniranjan at gmail.com>wrote:

> Hi,
>
> I have a list of > 200, 000   UNIQUE short EQUAL length sequences.
> I do the following
>
> I am comparing ALL sequences against ALL sequences so there will be (200000
> * 199999 )/2 comparisons
> Once a sequence is compared, if they differ from one another by ONE letter
> only . then I do another more detailed alignment using a BLOSUM matrix.
>
> Currently I use the pairwise sequence comparison code found in BIOPYTHON
> for both comparison, simple comparison where I set
> match = 0
> mismatch = -1
> If the total alignment score is equal to -1 (meaning only one mismatch)
> then I go a further step and do a BLOSUM alignment.
>
> This works but its taking a long long time, I suspect its because I am
> using TWO alignments but I think there could be a way to do the first
> simple alignment WITHOUT using the pairwise alignment code for the first
> part will speed up this calculation.
> Unfortunately I don't have much more than a desktop to do this, so if
> someone can suggest a quicker way to do this, I would appreciate it.
>
> Thank you,
> George
>
>
Hi George,

If your sequences are all equal length, and you're interested in the ones
that differ by 1 character, then the difference between any two of those
sequences of interest will be a single mismatched character. You don't need
to do an alignment at all.

Without Python: try clustering at whatever identify threshold corresponds
to edit distance 1 in your sequences. UCLUST/USEARCH and other programs can
do this quickly.

With Python: try an expression like:

seq_pairs_of_interest = []
for i, aseq in input_seq_list[:-1]:
    for j, bseq in input_seq_list[i+1:]:
        if sum(a != b for a, b in zip(aseq, bseq)) == 1:
            seq_pairs_of_interest.append((aseq, bseq))


Hope that helps,
Eric



More information about the Biopython mailing list