[Bioperl-l] Finding possible primers regex
Hilmar Lapp
hlapp at gmx.net
Sat Aug 9 12:07:30 EDT 2008
This looks like a neat trick. Do you think it's worth including as a
SimpleAlign method (obviously w/o the printing to STDOUT)? I can
imagine that a lot of people might appreciate it.
-hilmar
On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:
> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>
>>
>> Hi there,
>> I'm trying to write a perl script to scan an aligned multiple entry
>> fasta
>> file and find possible primers. So far I've produced a string which
>> contains
>> bases which match all sequences and * where they don't match e.g.
>> 1) TTAGCCTAA
>> 2) TTAGCAGAA
>> 3) TTACCCTAA
>>
>> would give TTA*C**AA.
>>
>> I want to parse this string and pull out all sequences which are
>> 18-21 bp in
>> length and have no more than 4 * in them.
>>
>> So far, I've got this:
>>
>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>> print "$1\n";
>> }
>>
>> hoping to match all fragments 18-21 characters in length. However
>> even that
>> doesn't work as it has essentially chunked it into 21 char blocks,
>> rather
>> than what I hoped for of
>> 0-18
>> 0-19
>> 0-20
>> 0-21
>> 1-19
>> 1-20
>> 1-21
>> 1-22
>>
>> etc.
>>
>> Can anyone let me know if this is already possible in BioPerl, or
>> how one
>> would go about it with regex. Sadly I'm fairly new to perl and
>> getting to
>> grips with BioPerl, so please treat me gently :).
>>
>> Many thanks,
>>
>> Ben
>
> There is a trick to this which is discussed more extensively in
> 'Mastering Regular Expressions'. Essentially you have to embed code
> into the regex and trick the parser into backtracking using a
> negative lookahead. The match itself fails (i.e. no match is
> returned), but the embedded code is executed for each match attempt,
>
> The following script is a slight modification of one I used which
> checks the consensus string from the input alignment (in aligned
> FASTA format here), extracts the alignment slice using that match,
> then spit the alignment out to STDOUT in clustalw format. This
> should work for perl 5.8 and up, but it's only been tested on perl
> 5.10. You should be able to use this to fit what you want.
>
> my $in = Bio::AlignIO->new(-file => $file,
> -format => 'fasta');
> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
> -format => 'clustalw');
>
> while (my $aln = $in->next_aln) {
> my $c = $aln->consensus_string(100);
> my @matches;
> $c =~ m/
> ([GTAC?]{18,21})
> (?{my $match = check_match($1);
> push @matches, [$match,
> pos(),
> length($match)]
> if defined $match;})
> (?!)
> /xig;
> for my $match (@matches) {
> my ($hit, $st, $end) = ($match->[0],
> $match->[1] - $match->[2] + 1,
> $match->[1]);
> my $newaln = $aln->slice($st, $end);
> $out->write_aln($newaln);
> }
> }
>
> sub check_match {
> my $match = shift;
> return unless $match;
> my $ct = $match =~ tr/?/?/;
> return $match if $ct <= 4;
> }
>
>
> chris
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Bioperl-l
mailing list