Bioperl: expert at reg. expressions: some patterns, thanks

James Freeman jfreeman@darwin.bu.edu
Thu, 8 Oct 1998 18:32:23 -0400 (EDT)


> Thanks to everyone for the TREMENDOUS response I got after posting the
> following message.  
> 
> >Could I solicit the expertise of anyone highly (and creatively) skilled  in
> >constructing regular expressions?  I have some patterns that I can't solve
> >the regular expressions for and I could use some good ideas....
> >dawn
> 
> I've had so many offers for help from generous or curious people looking
> for 'puzzles' and it's been requested that I post some patterns to the list
> to see how different people provide a solution. I hope no one minds that I
> posted these to the list, and special thanks to Andrew Dalke and Gustavo
> Glusman for the solutions I have gotten so far...
> 
> I study repetetive DNA so I'm very interested in patterns.  I've written
> programs to look for these patterns before but not in perl and I'm just
> learning the power of reg expressions.
> 
> 
> so for example, I need to match:
> 1.  pattern:
> >how to find QAQAQAQAQAQA in a protein sequences -- it's like finding an iteration of "QA", but
> >can I make a regular expression that doesn't need a motif like "QA"
> >specified?
> 
> offered solution
> 
> Try  /(..){2,}/  or  /(..)$1+/
> 
> $1 will tell you what the dipeptide was. length($&)/2 will tell you the
> number of copies.

Also try:

/(.)(.)(\1\2){2,}/

with the same length formula.  This is probably inferior to the above
regular expression.

> 
> 2.  pattern:
> 
> I understand (R|H){6,} finds all combinations of tracts of R and H of
> lengths 6
> >or greater.  But if I want only "combination" tracts that are made of a combination of BOTH R and H, how do I write an RE to exclude tracts of ONLY R (R)n  and ONLY H (H)n.

In your if statement put the following:

if( $foo =~ /(R|H){5,}/ && $& =~ /.+RH.+|.+HR.+/) {
}

> 
> 3. can I find a tract of Q (of minimum length N) followed by no more than X
> amino acids before another tract of Q (of minimum length N) is found again?
>  For example, to find:
> 
> AGTWRWDFDQQQQQQQQFAFCRCFCFAFAFCRFQQQQQQQQQQQQQ

if($foo =~ /Q{5,}[^Q]{16}Q{5,}/) {
} 
> 
> 4. how do I find tracts of an identical amino acid that are flanked at
> either end with the same amino acid...
> Good at: HTTTTTTTTTTH or  TGGGGGGGGGGGT
> 

if($foo =~ /(.)(.)\2+\1/ && $1 !~ /$2/) {
} 

I hope this helps,


Jim Freeman

> 
>  
> Dawn
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ** ** ** ** ** ** ** ** ** ** ** ** ** **
> ***************************************
> Dawn Field
> University of California, San Diego
> Department of Biology
> Rm #3165, Muir Biology
> 9500 Gilman Drive
> La Jolla, CA 92093-0116
> 
> e-mail dfield@ucsd.edu
> Tel  (619) 534-5474
> Fax  (619) 534-7108
> ***************************************
> ** ** ** ** ** ** ** ** ** ** ** ** ** **
> 
> 
> 
> 
> 
> 
> 
> 
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 


-- 
Jim Freeman  P: mammon@tiac.net W: jfreeman@darwin.bu.edu
Programmer/Analyst at Bio-Molecular Engineering Center at BU.
Enjoy yourself, its later than you think.
http://www.tiac.net/users/mammon/index.html

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================