[Biopython] comparing sequences.qustion

Wed Feb 8 16:08:30 UTC 2012

Argh, Gusfield "Algorithms on Strings, Trees, and Sequences" is the 
obligatory string matching reference...

- n

On 2/8/2012 10:50 AM, Nathan Edwards wrote:
>
> Classical method (essentially BYP, obligatory reference to Goldberg):
>
> * for each sequence, divide in two, get s1 and s2.
> * place the sequences (or an reference/index) in a dictionary with list
> values at key s1 and s2.
>
> This is linear time.
>
> Any pair of sequences that differ in only one position _must_ have at
> least one of their halves in common, so do detailed alignment on all
> pairs of sequences with a common key. You specified unique, so each pair
> must be considered at most once. If you had duplicates, these would be
> aligned for each of their halves (and you'd have to normalize these out,
> somehow). This will be a small fraction of all pairs, assuming these are
> not pathological sequences.
>
> This works well as long as the halves have enough specificity - for DNA
> length 10 halves should work. Note that this doesn't distinguish between
> left-halves and right-halves, which might have the same key values, but
> obviously won't differ by one. Fixing this is an easy modification. BTW,
> this works even for edit-distance. Only concern is the use of the
> in-memory dictionary data-structure, which can get big.
>
> Untested pseudocode:
>
> from collections import defaultdict
> from itertools import combinations
>
> n = 20
> halves = defaultdict(list)
> for s in sequences:
> s1 = s[:n/2]
> s2 = s[n/2:]
> halves[s1].append(s)
> halves[s2].append(s)
>
> for k in halves.iterkeys():
> for seq1,seq2 in combinations(halves[k],2):
> # check for one-change before expensive alignment?
> align(seq1,seq2)
>
> - n
>
> On 2/7/2012 8:50 PM, Eric Talevich wrote:
>> On Tue, Feb 7, 2012 at 8:01 PM, George Devaniranjan
>> <devaniranjan at gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I have a list of> 200, 000 UNIQUE short EQUAL length sequences.
>>> I do the following
>>>
>>> I am comparing ALL sequences against ALL sequences so there will be
>>> (200000
>>> * 199999 )/2 comparisons
>>> Once a sequence is compared, if they differ from one another by ONE
>>> letter
>>> only . then I do another more detailed alignment using a BLOSUM matrix.
>>>
>>> Currently I use the pairwise sequence comparison code found in BIOPYTHON
>>> for both comparison, simple comparison where I set
>>> match = 0
>>> mismatch = -1
>>> If the total alignment score is equal to -1 (meaning only one mismatch)
>>> then I go a further step and do a BLOSUM alignment.
>>>
>>> This works but its taking a long long time, I suspect its because I am
>>> using TWO alignments but I think there could be a way to do the first
>>> simple alignment WITHOUT using the pairwise alignment code for the first
>>> part will speed up this calculation.
>>> Unfortunately I don't have much more than a desktop to do this, so if
>>> someone can suggest a quicker way to do this, I would appreciate it.
>>>
>>> Thank you,
>>> George
>>>
>>>
>> Hi George,
>>
>> If your sequences are all equal length, and you're interested in the ones
>> that differ by 1 character, then the difference between any two of those
>> sequences of interest will be a single mismatched character. You don't
>> need
>> to do an alignment at all.
>>
>> Without Python: try clustering at whatever identify threshold corresponds
>> to edit distance 1 in your sequences. UCLUST/USEARCH and other
>> programs can
>> do this quickly.
>>
>> With Python: try an expression like:
>>
>> seq_pairs_of_interest = []
>> for i, aseq in input_seq_list[:-1]:
>> for j, bseq in input_seq_list[i+1:]:
>> if sum(a != b for a, b in zip(aseq, bseq)) == 1:
>> seq_pairs_of_interest.append((aseq, bseq))
>>
>>
>> Hope that helps,
>> Eric
>> _______________________________________________
>> Biopython mailing list - Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>

-- 
Dr. Nathan Edwards                      nje5 at georgetown.edu
Department of Biochemistry and Molecular & Cellular Biology
            Georgetown University Medical Center
Room 1215, Harris Building          Room 347, Basic Science
3300 Whitehaven St, NW              3900 Reservoir Road, NW
Washington DC 20007                     Washington DC 20007
Phone: 202-687-7042                     Phone: 202-687-1618
Fax: 202-687-0057                         Fax: 202-687-7186