[Biopython] comparing sequences.qustion

Nathan Edwards nje5 at georgetown.edu
Wed Feb 8 15:50:15 UTC 2012


Classical method (essentially BYP, obligatory reference to Goldberg):

* for each sequence, divide in two, get s1 and s2.
* place the sequences (or an reference/index) in a dictionary with list 
values at key s1 and s2.

This is linear time.

Any pair of sequences that differ in only one position _must_ have at 
least one of their halves in common, so do detailed alignment on all 
pairs of sequences with a common key. You specified unique, so each pair 
must be considered at most once. If you had duplicates, these would be 
aligned for each of their halves (and you'd have to normalize these out, 
somehow). This will be a small fraction of all pairs, assuming these are 
not pathological sequences.

This works well as long as the halves have enough specificity - for DNA 
length 10 halves should work. Note that this doesn't distinguish between 
left-halves and right-halves, which might have the same key values, but 
obviously won't differ by one. Fixing this is an easy modification. BTW, 
this works even for edit-distance. Only concern is the use of the 
in-memory dictionary data-structure, which can get big.

Untested pseudocode:

from collections import defaultdict
from itertools import combinations

n = 20
halves = defaultdict(list)
for s in sequences:
     s1 = s[:n/2]
     s2 = s[n/2:]
     halves[s1].append(s)
     halves[s2].append(s)

for k in halves.iterkeys():
     for seq1,seq2 in combinations(halves[k],2):
	# check for one-change before expensive alignment?
	align(seq1,seq2)

- n

On 2/7/2012 8:50 PM, Eric Talevich wrote:
> On Tue, Feb 7, 2012 at 8:01 PM, George Devaniranjan
> <devaniranjan at gmail.com>wrote:
>
>> Hi,
>>
>> I have a list of>  200, 000   UNIQUE short EQUAL length sequences.
>> I do the following
>>
>> I am comparing ALL sequences against ALL sequences so there will be (200000
>> * 199999 )/2 comparisons
>> Once a sequence is compared, if they differ from one another by ONE letter
>> only . then I do another more detailed alignment using a BLOSUM matrix.
>>
>> Currently I use the pairwise sequence comparison code found in BIOPYTHON
>> for both comparison, simple comparison where I set
>> match = 0
>> mismatch = -1
>> If the total alignment score is equal to -1 (meaning only one mismatch)
>> then I go a further step and do a BLOSUM alignment.
>>
>> This works but its taking a long long time, I suspect its because I am
>> using TWO alignments but I think there could be a way to do the first
>> simple alignment WITHOUT using the pairwise alignment code for the first
>> part will speed up this calculation.
>> Unfortunately I don't have much more than a desktop to do this, so if
>> someone can suggest a quicker way to do this, I would appreciate it.
>>
>> Thank you,
>> George
>>
>>
> Hi George,
>
> If your sequences are all equal length, and you're interested in the ones
> that differ by 1 character, then the difference between any two of those
> sequences of interest will be a single mismatched character. You don't need
> to do an alignment at all.
>
> Without Python: try clustering at whatever identify threshold corresponds
> to edit distance 1 in your sequences. UCLUST/USEARCH and other programs can
> do this quickly.
>
> With Python: try an expression like:
>
> seq_pairs_of_interest = []
> for i, aseq in input_seq_list[:-1]:
>      for j, bseq in input_seq_list[i+1:]:
>          if sum(a != b for a, b in zip(aseq, bseq)) == 1:
>              seq_pairs_of_interest.append((aseq, bseq))
>
>
> Hope that helps,
> Eric
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


-- 
Dr. Nathan Edwards                      nje5 at georgetown.edu
Department of Biochemistry and Molecular & Cellular Biology
            Georgetown University Medical Center
Room 1215, Harris Building          Room 347, Basic Science
3300 Whitehaven St, NW              3900 Reservoir Road, NW
Washington DC 20007                     Washington DC 20007
Phone: 202-687-7042                     Phone: 202-687-1618
Fax: 202-687-0057                         Fax: 202-687-7186



More information about the Biopython mailing list