[Bioperl-l] Finding possible primers regex
Christopher Fields
cjfields at illinois.edu
Mon Aug 11 15:50:38 EDT 2008
When I can I could try generating a method which accepts a regex/Bio::Tools::SeqPattern and returns an AlignIO stream or array of SimpleAlign instances (the former could be attached to a temp file for iteration). Any preference?
chris
---- Original message ----
>Date: Sat, 9 Aug 2008 12:07:30 -0400
>From: Hilmar Lapp <hlapp at gmx.net>
>Subject: Re: [Bioperl-l] Finding possible primers regex
>To: Chris Fields <cjfields at uiuc.edu>
>Cc: Benbo <btemperton at googlemail.com>, Bioperl-l at lists.open-bio.org
>
>This looks like a neat trick. Do you think it's worth including as a
>SimpleAlign method (obviously w/o the printing to STDOUT)? I can
>imagine that a lot of people might appreciate it.
>
> -hilmar
>
>On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:
>
>> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>>
>>>
>>> Hi there,
>>> I'm trying to write a perl script to scan an aligned multiple entry
>>> fasta
>>> file and find possible primers. So far I've produced a string which
>>> contains
>>> bases which match all sequences and * where they don't match e.g.
>>> 1) TTAGCCTAA
>>> 2) TTAGCAGAA
>>> 3) TTACCCTAA
>>>
>>> would give TTA*C**AA.
>>>
>>> I want to parse this string and pull out all sequences which are
>>> 18-21 bp in
>>> length and have no more than 4 * in them.
>>>
>>> So far, I've got this:
>>>
>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>> print "$1\n";
>>> }
>>>
>>> hoping to match all fragments 18-21 characters in length. However
>>> even that
>>> doesn't work as it has essentially chunked it into 21 char blocks,
>>> rather
>>> than what I hoped for of
>>> 0-18
>>> 0-19
>>> 0-20
>>> 0-21
>>> 1-19
>>> 1-20
>>> 1-21
>>> 1-22
>>>
>>> etc.
>>>
>>> Can anyone let me know if this is already possible in BioPerl, or
>>> how one
>>> would go about it with regex. Sadly I'm fairly new to perl and
>>> getting to
>>> grips with BioPerl, so please treat me gently :).
>>>
>>> Many thanks,
>>>
>>> Ben
>>
>> There is a trick to this which is discussed more extensively in
>> 'Mastering Regular Expressions'. Essentially you have to embed code
>> into the regex and trick the parser into backtracking using a
>> negative lookahead. The match itself fails (i.e. no match is
>> returned), but the embedded code is executed for each match attempt,
>>
>> The following script is a slight modification of one I used which
>> checks the consensus string from the input alignment (in aligned
>> FASTA format here), extracts the alignment slice using that match,
>> then spit the alignment out to STDOUT in clustalw format. This
>> should work for perl 5.8 and up, but it's only been tested on perl
>> 5.10. You should be able to use this to fit what you want.
>>
>> my $in = Bio::AlignIO->new(-file => $file,
>> -format => 'fasta');
>> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
>> -format => 'clustalw');
>>
>> while (my $aln = $in->next_aln) {
>> my $c = $aln->consensus_string(100);
>> my @matches;
>> $c =~ m/
>> ([GTAC?]{18,21})
>> (?{my $match = check_match($1);
>> push @matches, [$match,
>> pos(),
>> length($match)]
>> if defined $match;})
>> (?!)
>> /xig;
>> for my $match (@matches) {
>> my ($hit, $st, $end) = ($match->[0],
>> $match->[1] - $match->[2] + 1,
>> $match->[1]);
>> my $newaln = $aln->slice($st, $end);
>> $out->write_aln($newaln);
>> }
>> }
>>
>> sub check_match {
>> my $match = shift;
>> return unless $match;
>> my $ct = $match =~ tr/?/?/;
>> return $match if $ct <= 4;
>> }
>>
>>
>> chris
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>--
>===========================================================
>: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>===========================================================
>
>
>
More information about the Bioperl-l
mailing list