[Bioperl-l] get regions

Steve Chervitz sac at bioperl.org
Wed May 16 08:16:38 UTC 2007


On 5/15/07, Chris Fields <cjfields at uiuc.edu> wrote:
>
> On May 14, 2007, at 8:46 PM, Steve Chervitz wrote:
> ...
>
> > To generalize your code so that it will work for any pattern, such as
> > one that can match strings of variable length like "A{5,10}", just
> > subtract the length of the actual string that was matched:
> >
> > if ($gene =~ m/$pattern/gi)
> > {
> >     $start = pos($gene) - length($&) + 1;
> >  }
> >
> > Steve
>
> Right, but $& (as well as $` and $') inflict a significant penalty
> for their use, as Aaron alludes to.  Their use, even indirectly via a
> library module, can cause a significant performance hit.
>
> chris

Yes. I had forgotten how poisonous $&, $` and $' were to regex
performance. Please forgive me. We might consider regularly auditing
the bioperl module tree for use of these in committed code.

But regarding the use of the look ahead assertion, there's a problem
if you want to find *all* occurrences of the pattern in a target
string and the pattern can have variable length hits: it may report
overlapping hits because it only collects the starting points of the
match, and does not determine how long each match would be. For
example:

$gene = 'TTTAAAAAAAAGG';
$pattern="A{5,10}";
while ($gene =~ m/(?=$pattern)/gi) {
    $start = pos($gene) + 1;
    print ++$hit, " hit starts at $start\n";
}

Generates:
1 hit starts at 4
2 hit starts at 5
3 hit starts at 6
4 hit starts at 7

You could get around this by imposing a constraint to avoid trivial
overlaps. OK if you know the length of the pattern, but not so good
for more complex patterns. If there was I way to get the look ahead to
match the longest string possible for a variable length pattern, then
this approach could work, but I'm not sure if that is possible.

Here's a solution I think does the job of reporting the extent of each
match without a performance hit and works for patterns of any
complexity, taking advantage of the special arrays containing hit
indexes, @- and @+:

$gene = 'TTTAAAAAAAAGGGGAAAAAAGGGGG';
while ($gene =~ m/$pattern/gi){
    $hit++;
    printf "$hit hit at: %2d - %d\n", $-[0]+1, $+[0];
}

Generates:
1 hit at:  4 - 11
2 hit at: 16 - 21

You can also use this approach to report the locations of any internal
back references, if the pattern contains any parentheses, via $-[1],
$+[1], $-[2], $+[2] etc. You'll pay a performance hit when using such
patterns, but patterns not containing parens won't be penalized.

Steve



More information about the Bioperl-l mailing list