Bioperl: repetitive DNA

Toshinori Endo tendo@rtc.riken.go.jp
Mon, 08 Nov 1999 14:59:44 +0900


Here is the program to test for the validity of the RE and the result.
Although the RE works perfect for what was expressed,
the output seems slightly different from what Alessandro expected.
I wonder if this RE is really ok, especially when I see the side effect of
"matching the longest pattern" of regular expression matching
between repeat numbers 2 and 3.


Toshi



#!/usr/bin/perl

$sequence=<<EOF;
acgatgacgatgatatatatatatatacataatatatatcacaggggaatatatatatcccacataatata
EOF

print "org: ".$sequence;
for $NREPEAT (1..7) {
	$seq=$sequence;
	$seq =~ s/((.+)\2{$NREPEAT,})/'N' x length $1/eg;
	print "  $NREPEAT: $seq";
}

org: acgatgacgatgatatatatatatatacataatatatatcacaggggaatatatatatcccacataatata
  1: NNNNNNNNNNNNNNNNNNNNNNNNatacNNNNNNNNNNtNNNNNNNNNNNNNNNNNNtNNNacNNNNNNta
  2: acgatgacgatgNNNNNNNNNNNNatacataNNNNNNNNcacaNNNNaNNNNNNNNNNNNNacataatata
  3: acgatgacgatgNNNNNNNNNNNNNNacataNNNNNNNNcacaNNNNaNNNNNNNNNNcccacataatata
  4: acgatgacgatgNNNNNNNNNNNNNNacataatatatatcacaggggaNNNNNNNNNNcccacataatata
  5: acgatgacgatgNNNNNNNNNNNNNNacataatatatatcacaggggaatatatatatcccacataatata
  6: acgatgacgatgNNNNNNNNNNNNNNacataatatatatcacaggggaatatatatatcccacataatata
  7: acgatgacgatgatatatatatatatacataatatatatcacaggggaatatatatatcccacataatata



At 16:42 99/11/07 -0500, Lincoln Stein wrote:
> No module needed.  Here's a simple one-line regular expression that
> does everything that dust does.  It catches all repeats of unit length 
> 1 or greater that are repeated at least 4 times.
> 
>  $sequence =~ s/((.+)\2{4,})/'N' x length $1/eg;
> 
> This one occurred to me while writing problems for the CSHL genome
> informatics course.
> 
> Lincoln 
> 
> Alessandro Guffanti writes:
>  > Hi. I think a good solution could also be to use NCBI's DUST
>  > filter with a suitable cut-off, then retrieve the coordinates
>  > of masked sequences through a perl wrapper - c'est fait.
>  > You can retrieve DUST from WU ftp server:
>  > 
>  > ftp://blast.wustl.edu/pub/dust
>  > 
>  > >test
>  > acgatgacgatgatatatatatatatacataatatatatcacagggga
>  > atatatatatcccacataatata
>  > 
>  > dust test
>  > >test
>  > acgatgacgatgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNcc
>  > cacataatata
>  > 
>  > dust test 45
>  > >test
>  > acgatgacgatgatatatatatatatacataatatatatcacaggggaatatatatatcc
>  > cacataatata
>  > 
>  > 
>  > Best Wishes,
>  > 
>  > Alessandro.
>  > 
>  > BTW, I think that this could be a good startup for a "filtering"
>  > module. Do you think this could be interesting ? It could be a
>  > method in a sequence object or a separate module per se. The outcome
>  > could be a list of coordinates in the sequence which correspond to
>  > masked areas. I would be happy to produce a rough version of this.
>  > 
>  > 
>  > -- 
>  > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  >        Alessandro Guffanti - Informatics      
>  > The Sanger Centre, Wellcome Trust Genome Campus
>  >   Hinxton, Cambridge CB10 1SA, United Kingdom        
>  >     phone: +1223-834244 * fax: +1223-494919
>  >       http://www.sanger.ac.uk/Users/ag3
>  > =========== Bioperl Project Mailing List Message Footer =======
>  > Project URL: http://bio.perl.org/
>  > For info about how to (un)subscribe, where messages are archived, etc:
>  > http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
>  > ====================================================================
> 
> -- 
> ========================================================================
> Lincoln D. Stein                           Cold Spring Harbor Laboratory
> lstein@cshl.org			                  Cold Spring Harbor, NY
> ========================================================================
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 
> 


---------------------------------------------------
Toshinori Endo, Ph.D.
RIKEN Genomic Sciences Center
Koyadai 3-1-1, Tsukuba, Ibaraki 305-0074, Japan
TEL 0298-36-9145  FAX 0298-36-9098 CP 090-4753-3206
Email tendo@rtc.riken.go.jp
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================