what about the speed on longer seq? Re: [Bioperl-l] regular expression help!

Fri Jan 21 22:31:55 EST 2005

Thank you James for your detailed info. An earlier solution given is to use 
=~ /(\S{4,})(\S{10,}).+(??{sub($2)})\1/i; the sub is to do the transliteration and reversion of $2. It works greatly on ~80 bp seq. However, on a seq ~500 bp, it takes forever to do. Is there any similarity in processing time for the regex? I will definitely try it.
Have a great one,
Yang
----- Original Message -----
From: James D. White <jdw at ou.edu>
To: bioperl-l at portal.open-bio.org
Sent: Fri, 21 Jan 2005 11:54:37 -0500
Subject: Re: [Bioperl-l] regular expression help!


> Sorry about double posting, but I forgot to change the subject before
> sending the first message.
> 
> > Starting with:
> >
> > $regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;
> >
> > The slashes in tr/// confused the Perl parser.  You need to use
> > different delimiters for the m// operator (the m is implied by //)
> > and the tr/// operator.  Also the tr/// operator does not use the
> > i flag, so lower case needs to be handled explicitly.  So let's
> > try the following:
> >
> > $regex =~ m:\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~
> tr/ATCGatcg/TAGCtagc/);})\1.*:i;
> >
> > This gives the error:
> > Can't modify constant item in transliteration (tr///) at (re_eval 1)
> > line 1, near "tr/ATCGatcg/TAGCtagc/)"
> >
> > Inside the (??{ CODE }) sequence, use $1, $2, ..., instead of
> > \1, \2, ... (See Programming Perl, 3rd Edition, "Match-time pattern
> > interpolation", p. 213) Inside the evaluated CODE, \2 is a
> > constant, not the value of the second captured substring.  Also I'm
> > not sure what modifying $2 would do, so let's try:
> >
> > $regex =~ m:\S+(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/;
> reverse($rev);})\1.*:i;
> >
> > This works, but I would get rid of the leading "\S+" and trailing
> > ".*".  The ".*" adds nothing useful, so just drop it.  You
> > probably don't need the leading "\S+", because the pattern is not
> > anchored to the beginning of the string with "^".  The leading
> > "\S+" gobbles up the entire string, forcing the match to backtrack
> > character by character from the end.  It also forces the substring
> > match saved in $1 to occur after the first character.  Unless you
> > never want $1 to consider the first character, just drop the
> > leading "\S+".  If you don't want to search the first character,
> > then just use "\S".  This results in:
> >
> > $regex =~ m:(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/;
> reverse($rev);})\1:i;
> >
> > Finally I would probably change the remaining ".*" to ".*?".  If
> > you search with ".*" on a long sequence which could contain
> > multiple sequences of interest, the ".*" pattern will match the rest
> > of the sequence and force backtracking to match the first occurrence
> > of "$1$2" with the last occurrence of "revcomp($2)$1".  If you use
> > ".*?", you match the first occurrence of "$1$2" with the nearest
> > occurrence of "revcomp($2)$1".  This results in the final regular
> > expression:
> >
> > $regex =~ m:(\S+)(\S{10}).*?(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/;
> reverse($rev);})\1:i;
> >
> > > Date: Fri, 14 Jan 2005 14:12:46 -0500
> > > From: Guojun Yang <gyang at plantbio.uga.edu>
> > > Subject: [Bioperl-l] regular expression help!
> > > To: bioperl-l at portal.open-bio.org
> > > Message-ID: <20050114141246.94c7cb46 at dogwood.plantbio.uga.edu>
> > > Content-Type: text/plain;       charset="us-ascii"
> > >
> > > Hi, Everybody,
> > > I was trying to use a regex recognizing a patter of inverted repeat DNA seq
> flanked by direct repeats (see below), it returns errors saying "(?{...}) not
> terminated or {...} not balanced. Can anybody help me sorting this out?
> > > The regex I have is:
> > > $regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~
> tr/ATCG/TAGC/i);})\1.*/i;
> > > Thank you,
> > > Yang
> > >
> >
> > --
> > James D. White   (jdw at ou.edu)
> > Director of Bioinformatics
> > Department of Chemistry and Biochemistry/ACGT
> > University of Oklahoma
> > 101 David L. Boren Blvd., SRTC 2100
> > Norman, OK 73019
> > Phone: (405) 325-4912, FAX: (405) 325-7762
> 
> --
> James D. White   (jdw at ou.edu)
> Director of Bioinformatics
> Department of Chemistry and Biochemistry/ACGT
> University of Oklahoma
> 101 David L. Boren Blvd., SRTC 2100
> Norman, OK 73019
> Phone: (405) 325-4912, FAX: (405) 325-7762
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>