what about the speed on longer seq? Re: [Bioperl-l] regular expression help!

James D. White jdw at ou.edu
Mon Jan 24 14:48:40 EST 2005


Your original regex would have required O(n**4) attempts to match, 
because the unanchored starting position, "\S+", "(\S+)", and ".*" each 
involve O(n) possibilities.

The unanchored starting position and each occurrence of ".*", "\S+", or 
a repeat range with no upper limit (e.g., "\S{4,}") can multiply by O(n) 
possible matches to be tested.

My regex is O(n**3) for the unanchored starting position, "(\S+)", and 
".*".  Adding the minimum 4 bases for the first repeat, my regex becomes:

=~ m:(\S{4,})(\S{10}).*?(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; 
reverse($rev);})\1:i;

Your new regex is also O(n**4) for the unanchored starting position, 
"(\S{4,})", "(\S{10,})", and ".+", but the new regex does recognize 
longer inverted repeats.  The initial uncaptured \S+ is no longer there, 
but "(\S{10,})" which matches longer inverted repeats has replaced it in 
bringing the number of matches to be tested back up.  I might change
".*" to ".*?" to reduce the need for backtracking, resulting in:

=~ /(\S{4,})(\S{10,}).+?(??{sub($2)})\1/i;

but this is still O(n**4).

If my version is modified to find longer inverted repeats by using 
(\S{10,}), then it would also become O(n**4), but, I suspect that not 
calling the sub() avoids some extra overhead not present in your 
version, but I have not tried to examine any generated code nor run any 
tests to be sure.

Knowledge of upper limits for each repeat length can greatly reduce the 
number of unproductive matches by changing "*" to "{0,max}", "+" to 
"{1,max}", and "{min,}" to "{min,max}".  But I do not know if any 
reasonable limits are known for your data.

In order to bring the order back down to O(n**3) and still find the 
longer inverted repeats, let's break the problem up into finding the 
original 10 base inverted repeat and then extending it.  Using \1 and \2 
to represent the repeated substrings and revcomp() as the reverse 
complement of its argument, the matched sequence is "\1\2.*revcomp(\2)\1".

If the full \2 is longer than the minimum 10 bases, let's call the first 
10 bases \2a and the rest \2b.  The matched sequence is now 
"\1\2a\2b.*revcomp(\2b)revcomp(\2a)\1".  Now searching for only a 10 
base inverted repeat simplifies to "\1\2a.*revcomp(\2a)\1", which is an 
O(n**3) operation using my regex.  Extending the inverted repeat is 
O(n), but the combined process is not O(n**3)*O(n), but O(n**3)+O(n) 
which is still O(n**3).

Unfortunately the same process does not work for \1, because 
"\1a\1b\2.*revcomp(\2)\1a\1b" becomes either "\1a.*\2.*revcomp(\2)\1a" 
or "\1b\2.*revcomp(\2).*\1b", which is still O(n**4) in either case.

So the best I can come up with is (not tested):

# add parens to capture .*?
$string =~ m:(\S{4,})(\S{10})(.*?)(??{$rev = $2; $rev =~ 
tr/ATCGatcg/TAGCtagc/; reverse($rev);})\1:i;
# Save repeats and middle.  You can also look at @- and @+
# to find the positions of the matching substrings.
$direct = $1;
$inverted = $2;
$middle = $3;
# find the length of the extension for the inverted repeat
$low = 0;
$high = length($middle) - 1;
while ($low < $high) {
	$rc = lc(substr($middle, $high, 1));
	$rc =~ tr/atcg/tagc/;
	last if lc(substr($middle, $low, 1)) ne $rc;
	$low++;
	$high--;
	}
# extend the repeat, if necessary
if ($low) {
	$inverted .= substr($middle, 0, $low, '');
	substr($middle, -$low) = '';
	}

I hope this is helpful.  If the process is still too slow, then you can 
extend these ideas by finding fixed length direct and inverted repeats 
separately using O(n**2) regexes:

=~ m/(\S{4})(.*?)\1/i;		# to find direct repeats

=~ m:(\S{10})(.*?)(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/; 
reverse($rev);}):i;		# to find inverted repeats

Then extend them using techniques similar to the above.  Once you have 
the separate direct and inverted repeat locations, then you have to 
match neighboring repeats to find what you want.  This technique is more 
general and allows you, for example, to find sequences which have a few 
mismatched bases between the repeat pairs without exploding the 
complexity of the search.  There is however some increase in programming 
effort.

Good luck,

Jim White

Guojun Yang wrote:
> Thank you James for your detailed info. An earlier solution given is to use 
> =~ /(\S{4,})(\S{10,}).+(??{sub($2)})\1/i; the sub is to do the transliteration and reversion of $2. It works greatly on ~80 bp seq. However, on a seq ~500 bp, it takes forever to do. Is there any similarity in processing time for the regex? I will definitely try it.
> Have a great one,
> Yang
> ----- Original Message -----
> From: James D. White <jdw at ou.edu>
> To: bioperl-l at portal.open-bio.org
> Sent: Fri, 21 Jan 2005 11:54:37 -0500
> Subject: Re: [Bioperl-l] regular expression help!
> 
> 
> 
>>Sorry about double posting, but I forgot to change the subject before
>>sending the first message.
>>
>>
>>>Starting with:
>>>
>>>$regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~ tr/ATCG/TAGC/i);})\1.*/i;
>>>
>>>The slashes in tr/// confused the Perl parser.  You need to use
>>>different delimiters for the m// operator (the m is implied by //)
>>>and the tr/// operator.  Also the tr/// operator does not use the
>>>i flag, so lower case needs to be handled explicitly.  So let's
>>>try the following:
>>>
>>>$regex =~ m:\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~
>>
>>tr/ATCGatcg/TAGCtagc/);})\1.*:i;
>>
>>>This gives the error:
>>>Can't modify constant item in transliteration (tr///) at (re_eval 1)
>>>line 1, near "tr/ATCGatcg/TAGCtagc/)"
>>>
>>>Inside the (??{ CODE }) sequence, use $1, $2, ..., instead of
>>>\1, \2, ... (See Programming Perl, 3rd Edition, "Match-time pattern
>>>interpolation", p. 213) Inside the evaluated CODE, \2 is a
>>>constant, not the value of the second captured substring.  Also I'm
>>>not sure what modifying $2 would do, so let's try:
>>>
>>>$regex =~ m:\S+(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/;
>>
>>reverse($rev);})\1.*:i;
>>
>>>This works, but I would get rid of the leading "\S+" and trailing
>>>".*".  The ".*" adds nothing useful, so just drop it.  You
>>>probably don't need the leading "\S+", because the pattern is not
>>>anchored to the beginning of the string with "^".  The leading
>>>"\S+" gobbles up the entire string, forcing the match to backtrack
>>>character by character from the end.  It also forces the substring
>>>match saved in $1 to occur after the first character.  Unless you
>>>never want $1 to consider the first character, just drop the
>>>leading "\S+".  If you don't want to search the first character,
>>>then just use "\S".  This results in:
>>>
>>>$regex =~ m:(\S+)(\S{10}).*(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/;
>>
>>reverse($rev);})\1:i;
>>
>>>Finally I would probably change the remaining ".*" to ".*?".  If
>>>you search with ".*" on a long sequence which could contain
>>>multiple sequences of interest, the ".*" pattern will match the rest
>>>of the sequence and force backtracking to match the first occurrence
>>>of "$1$2" with the last occurrence of "revcomp($2)$1".  If you use
>>>".*?", you match the first occurrence of "$1$2" with the nearest
>>>occurrence of "revcomp($2)$1".  This results in the final regular
>>>expression:
>>>
>>>$regex =~ m:(\S+)(\S{10}).*?(??{$rev = $2; $rev =~ tr/ATCGatcg/TAGCtagc/;
>>
>>reverse($rev);})\1:i;
>>
>>>>Date: Fri, 14 Jan 2005 14:12:46 -0500
>>>>From: Guojun Yang <gyang at plantbio.uga.edu>
>>>>Subject: [Bioperl-l] regular expression help!
>>>>To: bioperl-l at portal.open-bio.org
>>>>Message-ID: <20050114141246.94c7cb46 at dogwood.plantbio.uga.edu>
>>>>Content-Type: text/plain;       charset="us-ascii"
>>>>
>>>>Hi, Everybody,
>>>>I was trying to use a regex recognizing a patter of inverted repeat DNA seq
>>
>>flanked by direct repeats (see below), it returns errors saying "(?{...}) not
>>terminated or {...} not balanced. Can anybody help me sorting this out?
>>
>>>>The regex I have is:
>>>>$regex =~ /\S+(\S+)(\S{10}).*(??{$rev=reverse(\2 =~
>>
>>tr/ATCG/TAGC/i);})\1.*/i;
>>
>>>>Thank you,
>>>>Yang
>>>>
>>>
>>>--
>>>James D. White   (jdw at ou.edu)
>>>Director of Bioinformatics
>>>Department of Chemistry and Biochemistry/ACGT
>>>University of Oklahoma
>>>101 David L. Boren Blvd., SRTC 2100
>>>Norman, OK 73019
>>>Phone: (405) 325-4912, FAX: (405) 325-7762
>>
>>--
>>James D. White   (jdw at ou.edu)
>>Director of Bioinformatics
>>Department of Chemistry and Biochemistry/ACGT
>>University of Oklahoma
>>101 David L. Boren Blvd., SRTC 2100
>>Norman, OK 73019
>>Phone: (405) 325-4912, FAX: (405) 325-7762
>>
>>
>>
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at portal.open-bio.org
>>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
> 
> 






More information about the Bioperl-l mailing list