[Bioperl-l] get regions

Chris Fields cjfields at uiuc.edu
Wed May 16 16:02:28 UTC 2007


On May 16, 2007, at 3:16 AM, Steve Chervitz wrote:
...

>>
>> Right, but $& (as well as $` and $') inflict a significant penalty
>> for their use, as Aaron alludes to.  Their use, even indirectly via a
>> library module, can cause a significant performance hit.
>>
>> chris
>
> Yes. I had forgotten how poisonous $&, $` and $' were to regex
> performance. Please forgive me. We might consider regularly auditing
> the bioperl module tree for use of these in committed code.

Already done!  We have run a few audits for gotchas like that:

http://www.bioperl.org/wiki/Auditing

http://www.bioperl.org/wiki/Bioperl_Best_Practices

If there is anything we should be looking for please feel free to add  
as needed.  There shouldn't be any use of the 'naughty' variables in  
CVS, but it might be worth a second look...

> But regarding the use of the look ahead assertion, there's a problem
> if you want to find *all* occurrences of the pattern in a target
> string and the pattern can have variable length hits: it may report
> overlapping hits because it only collects the starting points of the
> match, and does not determine how long each match would be. For
> example:
>
> $gene = 'TTTAAAAAAAAGG';
> $pattern="A{5,10}";
> while ($gene =~ m/(?=$pattern)/gi) {
>     $start = pos($gene) + 1;
>     print ++$hit, " hit starts at $start\n";
> }
>
> Generates:
> 1 hit starts at 4
> 2 hit starts at 5
> 3 hit starts at 6
> 4 hit starts at 7
>
> You could get around this by imposing a constraint to avoid trivial
> overlaps. OK if you know the length of the pattern, but not so good
> for more complex patterns. If there was I way to get the look ahead to
> match the longest string possible for a variable length pattern, then
> this approach could work, but I'm not sure if that is possible.
>
> Here's a solution I think does the job of reporting the extent of each
> match without a performance hit and works for patterns of any
> complexity, taking advantage of the special arrays containing hit
> indexes, @- and @+:
>
> $gene = 'TTTAAAAAAAAGGGGAAAAAAGGGGG';
> while ($gene =~ m/$pattern/gi){
>     $hit++;
>     printf "$hit hit at: %2d - %d\n", $-[0]+1, $+[0];
> }
>
> Generates:
> 1 hit at:  4 - 11
> 2 hit at: 16 - 21
>
> You can also use this approach to report the locations of any internal
> back references, if the pattern contains any parentheses, via $-[1],
> $+[1], $-[2], $+[2] etc. You'll pay a performance hit when using such
> patterns, but patterns not containing parens won't be penalized.
>
> Steve

Friedl's Regex book has outlined a few ways to get around the  
'naughty' variables $`, $&, and $' using substr() and $-[0], $+[0],  
or both, which makes sense since @+ and @- are arrays of positions  
instead of actual text.

$`  substr(target, 0, $-[0])
$&  substr(target, $-[0], $+[0] - $-[0])
$'  substr(target, $+[0])

Wonderful book!

chris



More information about the Bioperl-l mailing list