[Bioperl-l] Finding possible primers regex

Mon Aug 11 22:35:13 EDT 2008

Actually, now that you ask I'm wondering whether one wouldn't  
sometimes want to retain the relationship between the match and the  
resulting spliced alignment? If so, neither AlignIO nor array would  
accomplish that, right?

Other than that I myself don't have a strong preference either way. I  
suppose AlignIO stream is somewhat more extensible, since as you say  
it could be coupled to a file if the resulting set of alignments is  
really large.

	-hilmar

On Aug 11, 2008, at 3:50 PM, Christopher Fields wrote:

> When I can I could try generating a method which accepts a regex/ 
> Bio::Tools::SeqPattern and returns an AlignIO stream or array of  
> SimpleAlign instances (the former could be attached to a temp file  
> for iteration).  Any preference?
>
> chris
>
> ---- Original message ----
>> Date: Sat, 9 Aug 2008 12:07:30 -0400
>> From: Hilmar Lapp <hlapp at gmx.net>
>> Subject: Re: [Bioperl-l] Finding possible primers regex
>> To: Chris Fields <cjfields at uiuc.edu>
>> Cc: Benbo <btemperton at googlemail.com>, Bioperl-l at lists.open-bio.org
>>
>> This looks like a neat trick. Do you think it's worth including as a
>> SimpleAlign method (obviously w/o the printing to STDOUT)? I can
>> imagine that a lot of people might appreciate it.
>>
>> 	-hilmar
>>
>> On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:
>>
>>> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>>>
>>>>
>>>> Hi there,
>>>> I'm trying to write a perl script to scan an aligned multiple entry
>>>> fasta
>>>> file and find possible primers. So far I've produced a string which
>>>> contains
>>>> bases which match all sequences and * where they don't match e.g.
>>>> 1) TTAGCCTAA
>>>> 2) TTAGCAGAA
>>>> 3) TTACCCTAA
>>>>
>>>> would give TTA*C**AA.
>>>>
>>>> I want to parse this string and pull out all sequences which are
>>>> 18-21 bp in
>>>> length and have no more than 4 * in them.
>>>>
>>>> So far, I've got this:
>>>>
>>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>>> print "$1\n";
>>>> }
>>>>
>>>> hoping to match all fragments 18-21 characters in length. However
>>>> even that
>>>> doesn't work as it has essentially chunked it into 21 char blocks,
>>>> rather
>>>> than what I hoped for of
>>>> 0-18
>>>> 0-19
>>>> 0-20
>>>> 0-21
>>>> 1-19
>>>> 1-20
>>>> 1-21
>>>> 1-22
>>>>
>>>> etc.
>>>>
>>>> Can anyone let me know if this is already possible in BioPerl, or
>>>> how one
>>>> would go about it with regex. Sadly I'm fairly new to perl and
>>>> getting to
>>>> grips with BioPerl, so please treat me gently :).
>>>>
>>>> Many thanks,
>>>>
>>>> Ben
>>>
>>> There is a trick to this which is discussed more extensively in
>>> 'Mastering Regular Expressions'.  Essentially you have to embed code
>>> into the regex and trick the parser into backtracking using a
>>> negative lookahead.  The match itself fails (i.e. no match is
>>> returned), but the embedded code is executed for each match attempt,
>>>
>>> The following script is a slight modification of one I used which
>>> checks the consensus string from the input alignment (in aligned
>>> FASTA format here), extracts the alignment slice using that match,
>>> then spit the alignment out to STDOUT in clustalw format.  This
>>> should work for perl 5.8 and up, but it's only been tested on perl
>>> 5.10.  You should be able to use this to fit what you want.
>>>
>>> my $in = Bio::AlignIO->new(-file => $file,
>>>                          -format => 'fasta');
>>> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
>>>                          -format => 'clustalw');
>>>
>>> while (my $aln = $in->next_aln) {
>>>   my $c = $aln->consensus_string(100);
>>>   my @matches;
>>>   $c =~ m/
>>>       ([GTAC?]{18,21})
>>>       (?{my $match = check_match($1);
>>>          push @matches, [$match,
>>>                          pos(),
>>>                          length($match)]
>>>             if defined $match;})
>>>       (?!)
>>>       /xig;
>>>   for my $match (@matches) {
>>>       my ($hit, $st, $end) = ($match->[0],
>>>                               $match->[1] - $match->[2] + 1,
>>>                               $match->[1]);
>>>       my $newaln = $aln->slice($st, $end);
>>>       $out->write_aln($newaln);
>>>   }
>>> }
>>>
>>> sub check_match {
>>>   my $match = shift;
>>>   return unless $match;
>>>   my $ct = $match =~ tr/?/?/;
>>>   return $match if $ct <= 4;
>>> }
>>>
>>>
>>> chris
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> -- 
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================