[Bioperl-l] Sequence matching problem!

Albert Vilella avilella at gmail.com
Fri Feb 23 09:59:49 UTC 2007


now that we are at this pattern matching thread, I was wondering if
any perl guru could enlighten me on the issue of matching exact
sequence patterns on a gapped target sequence. E.g.:

my $seq = "CGATCAACGAATCGTACGTACTC";
my $gapped_seq =
"GGGGGGCG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTCGGGGG";

and one would like to get as a result:

"CG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTC"

which is the match of $seq but in $gapped_seq.

Cheers,

    Albert.


On 2/23/07, Heikki Lehvaslaiho <heikki at sanbi.ac.za> wrote:
> Kurt,
>
> There are  few things in your code to note:
>
> - regexp /C*T/ matches any T preceded by zero or more Cs,
>   not what you meant
> - $- and $+ are among the "expensive" perl functions worth
>   not using unless you have to. Using them once in your
>   code slows execution down considerable. There is always
>   an other way.
> - Keep in mind what you want to use the match positions for:
>   Human readable locations usually start counting with 1 but
>   perl code uses 0 as the first location. The code below assumes
>   you want to print the locations out.
>
> Study my example code below.
>
> Yours,
>         -Heikki
>
> ###################################################################
> #!/usr/bin/perl
> $seq = "GATCAAT";
> #$pattern=  'C*T';
> $pattern=  'C.*T';
>
> while ($seq =~ m/($pattern)/gi) {
>
>     $match = $1;
>     $end = pos($seq);
>     $start = $end - length($match) +1;
>
>     print "$match : $start - $end\n";
> }
>
> ###################################################################
>
>
> On Thursday 22 February 2007 22:41:37 Kurt Gobain wrote:
> > Hi every1..
> > I m facing a great deal of problem in simple pattern matching between
> > sequence & a pattern ..Program shod be designed such a way that it shod be
> > able do two things 1) normal matching...For eg: GATCAAT....if TC is
> > entered... output shod be 2...2) matching using spl character..In same
> > example if C*T value is entered It shod give o/p as 3 & seq to b displayed
> > is CAAT..I m easily getting 1st part...But in 2nd part Its giving sum
> > problem..output I m gettin as 1 instead of 3...Code is really simple!
> >
> > #!/usr/bin/perl
> > $alphabet = "GATCAAT";
> > $pattern=  "C*T ";
> >
> > $alphabet =~ /($pattern)/i;
> >
> > print "The entire '$pattern' match began at $-[0] and ended at $+[0]\n";
> >
> > ====================
> > OUTPUT!
> > The entire C*T match began at 1 and ended at 2
> > ====================
> >
> > but the o/p shod be 3????
> > & Is there n e chance I can get seq too..I mean instead of C*T'' i need
> > 'CAAT'...????
> >
> > Well..Its not compulsion to use regex....But I find it quite simple..can
> > there be n e other method??
> >
> > Thanx in advance!
> > Kurt!
>
>
>
> --
> ______ _/      _/_____________________________________________________
>       _/      _/
>      _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
>     _/_/_/_/_/  Associate Professor    skype: heikki_lehvaslaiho
>    _/  _/  _/  SANBI, South African National Bioinformatics Institute
>   _/  _/  _/  University of Western Cape, South Africa
>      _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
> ___ _/_/_/_/_/________________________________________________________
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list