[Bioperl-l] Allowing One error in Sequence matching

Abhishek Pratap abhishek.vit at gmail.com
Thu Sep 17 03:12:20 UTC 2009


Thanks Russell.

I think having a "approx matching" method in bioperl will help
specially with NGS data where read matching with 1/2/3/4 errors is
sometimes needed.

Cheers,
-Abhi



On Wed, Sep 16, 2009 at 9:46 PM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz> wrote:
> I misread your question, my example will match NGCT, ANCT, AGNT, or ACGN with 1 miss-match (or NGNT, NGCN, ANNT, ANCT etc with 2 miss-matches)
> The eval is just doing a regex on the match string created by the loop - "[AN][GN][CN][TN]"
> If your word size is short and you're not using too many mismatches, brute-forcing it with a compiled regex would probably work.
>
>
>> -----Original Message-----
>> From: Abhishek Pratap [mailto:abhishek.vit at gmail.com]
>> Sent: Thursday, 17 September 2009 1:39 p.m.
>> To: Smithies, Russell
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Allowing One error in Sequence matching
>>
>> Hi Russell
>>
>> Thanks for a quick reply. However I am not following the code clearly
>> and the reason behind it.
>>
>> Will this work for  matching AGCT  to ACCT | ANCT | AACT. It dint give
>> me the expected output when I ran it. I am more interested in
>> understanding the logic.
>>
>> It would be great if you could expand a bit more.
>>
>>
>> Also if I do it the brute force way as suggested to me by a frnd , how
>> will that work in terms of scalability.
>>
>> @dna1=split(//,$a);
>> @dna2=split(//,$b);
>> $x=0;
>> for($i=0;$i<@dna1;$i++){
>>         if ($dna1[$i] ne $dna2[$i]){
>>                         $x++;
>>         }
>> }
>>
>> if($x<=1){
>>         print "RESULT: your sequence is true\n";
>> }
>>
>> else { print " RESULT: your sequence is false\n";}
>>
>> Thanks,
>> -Abhi
>>
>>
>> On Wed, Sep 16, 2009 at 7:06 PM, Smithies, Russell
>> <Russell.Smithies at agresearch.co.nz> wrote:
>> > How about chunk it into overlapping words, skip if >2 N, then regex?
>> >
>> > $seq =
>> "CGATCGNATGNCGTCTAGCTGACANGTTGACTCTAGCTGATCGATCGATCGTACGTANNCGTAGTCGTACNTACGAT
>> CTNACGCACGNATGCTACGTACG";
>> >
>> > $motif = "ACGT";
>> > foreach (split //, $motif) {$w .= "[${_}N]"}
>> >
>> > foreach ($seq =~ /(?=(\w{4}))/g){
>> >  next if tr/N/N/ >= 2;
>> >  print "$_\n" if  eval "/$w/" ;
>> > }
>> >
>> >
>> >
>> >> -----Original Message-----
>> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> >> bounces at lists.open-bio.org] On Behalf Of Abhishek Pratap
>> >> Sent: Thursday, 17 September 2009 9:42 a.m.
>> >> To: bioperl-l at lists.open-bio.org
>> >> Subject: [Bioperl-l] Allowing One error in Sequence matching
>> >>
>> >> Hi All
>> >>
>> >> I am not able to think of smart way to do sequence matching allowing
>> >> userdefined number of mismatches.
>> >>
>> >> For eg:
>> >>
>> >> Given Sequence : AGCT will be considered a match to reference if any
>> >> one base pair position #(1,2,3,4)  has a mismatch that is  [ACGTN] so
>> >> the possible matches could be
>> >>
>> >> This is for position 1.
>> >> AGCT
>> >> GGCT
>> >> CGCT
>> >> TGCT
>> >> NGCT
>> >> and likewise for each position.
>> >>
>> >> any nice regular expression. One way that I could think was to
>> >> generate all the possible tags for a given sequence and then do the
>> >> matching. It will be a computationally expensive for long dataset .
>> >> Any neat method ?
>> >>
>> >> Thanks,
>> >> -Abhi
>> >> _______________________________________________
>> >> Bioperl-l mailing list
>> >> Bioperl-l at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> > =======================================================================
>> > Attention: The information contained in this message and/or attachments
>> > from AgResearch Limited is intended only for the persons or entities
>> > to which it is addressed and may contain confidential and/or privileged
>> > material. Any review, retransmission, dissemination or other use of, or
>> > taking of any action in reliance upon, this information by persons or
>> > entities other than the intended recipients is prohibited by AgResearch
>> > Limited. If you have received this message in error, please notify the
>> > sender immediately.
>> > =======================================================================
>> >
>




More information about the Bioperl-l mailing list