[Bioperl-l] Finding possible primers regex
Chris Fields
cjfields at uiuc.edu
Mon Aug 4 00:08:51 EDT 2008
On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>
> Hi there,
> I'm trying to write a perl script to scan an aligned multiple entry
> fasta
> file and find possible primers. So far I've produced a string which
> contains
> bases which match all sequences and * where they don't match e.g.
> 1) TTAGCCTAA
> 2) TTAGCAGAA
> 3) TTACCCTAA
>
> would give TTA*C**AA.
>
> I want to parse this string and pull out all sequences which are
> 18-21 bp in
> length and have no more than 4 * in them.
>
> So far, I've got this:
>
> while($fragment_match =~ /([GTAC*]{18,21})/g){
> print "$1\n";
> }
>
> hoping to match all fragments 18-21 characters in length. However
> even that
> doesn't work as it has essentially chunked it into 21 char blocks,
> rather
> than what I hoped for of
> 0-18
> 0-19
> 0-20
> 0-21
> 1-19
> 1-20
> 1-21
> 1-22
>
> etc.
>
> Can anyone let me know if this is already possible in BioPerl, or
> how one
> would go about it with regex. Sadly I'm fairly new to perl and
> getting to
> grips with BioPerl, so please treat me gently :).
>
> Many thanks,
>
> Ben
There is a trick to this which is discussed more extensively in
'Mastering Regular Expressions'. Essentially you have to embed code
into the regex and trick the parser into backtracking using a negative
lookahead. The match itself fails (i.e. no match is returned), but
the embedded code is executed for each match attempt,
The following script is a slight modification of one I used which
checks the consensus string from the input alignment (in aligned FASTA
format here), extracts the alignment slice using that match, then spit
the alignment out to STDOUT in clustalw format. This should work for
perl 5.8 and up, but it's only been tested on perl 5.10. You should
be able to use this to fit what you want.
my $in = Bio::AlignIO->new(-file => $file,
-format => 'fasta');
my $out = Bio::AlignIO->new(-fh => \*STDOUT,
-format => 'clustalw');
while (my $aln = $in->next_aln) {
my $c = $aln->consensus_string(100);
my @matches;
$c =~ m/
([GTAC?]{18,21})
(?{my $match = check_match($1);
push @matches, [$match,
pos(),
length($match)]
if defined $match;})
(?!)
/xig;
for my $match (@matches) {
my ($hit, $st, $end) = ($match->[0],
$match->[1] - $match->[2] + 1,
$match->[1]);
my $newaln = $aln->slice($st, $end);
$out->write_aln($newaln);
}
}
sub check_match {
my $match = shift;
return unless $match;
my $ct = $match =~ tr/?/?/;
return $match if $ct <= 4;
}
chris
More information about the Bioperl-l
mailing list