[Bioperl-l] Finding possible primers regex

Sat Aug 9 12:07:30 EDT 2008

This looks like a neat trick. Do you think it's worth including as a  
SimpleAlign method (obviously w/o the printing to STDOUT)? I can  
imagine that a lot of people might appreciate it.

	-hilmar

On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:

> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>
>>
>> Hi there,
>> I'm trying to write a perl script to scan an aligned multiple entry  
>> fasta
>> file and find possible primers. So far I've produced a string which  
>> contains
>> bases which match all sequences and * where they don't match e.g.
>> 1) TTAGCCTAA
>> 2) TTAGCAGAA
>> 3) TTACCCTAA
>>
>> would give TTA*C**AA.
>>
>> I want to parse this string and pull out all sequences which are  
>> 18-21 bp in
>> length and have no more than 4 * in them.
>>
>> So far, I've got this:
>>
>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>> print "$1\n";
>> }
>>
>> hoping to match all fragments 18-21 characters in length. However  
>> even that
>> doesn't work as it has essentially chunked it into 21 char blocks,  
>> rather
>> than what I hoped for of
>> 0-18
>> 0-19
>> 0-20
>> 0-21
>> 1-19
>> 1-20
>> 1-21
>> 1-22
>>
>> etc.
>>
>> Can anyone let me know if this is already possible in BioPerl, or  
>> how one
>> would go about it with regex. Sadly I'm fairly new to perl and  
>> getting to
>> grips with BioPerl, so please treat me gently :).
>>
>> Many thanks,
>>
>> Ben
>
> There is a trick to this which is discussed more extensively in  
> 'Mastering Regular Expressions'.  Essentially you have to embed code  
> into the regex and trick the parser into backtracking using a  
> negative lookahead.  The match itself fails (i.e. no match is  
> returned), but the embedded code is executed for each match attempt,
>
> The following script is a slight modification of one I used which  
> checks the consensus string from the input alignment (in aligned  
> FASTA format here), extracts the alignment slice using that match,  
> then spit the alignment out to STDOUT in clustalw format.  This  
> should work for perl 5.8 and up, but it's only been tested on perl  
> 5.10.  You should be able to use this to fit what you want.
>
> my $in = Bio::AlignIO->new(-file => $file,
>                           -format => 'fasta');
> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
>                           -format => 'clustalw');
>
> while (my $aln = $in->next_aln) {
>    my $c = $aln->consensus_string(100);
>    my @matches;
>    $c =~ m/
>        ([GTAC?]{18,21})
>        (?{my $match = check_match($1);
>           push @matches, [$match,
>                           pos(),
>                           length($match)]
>              if defined $match;})
>        (?!)
>        /xig;
>    for my $match (@matches) {
>        my ($hit, $st, $end) = ($match->[0],
>                                $match->[1] - $match->[2] + 1,
>                                $match->[1]);
>        my $newaln = $aln->slice($st, $end);
>        $out->write_aln($newaln);
>    }
> }
>
> sub check_match {
>    my $match = shift;
>    return unless $match;
>    my $ct = $match =~ tr/?/?/;
>    return $match if $ct <= 4;
> }
>
>
> chris
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================