[Bioperl-l] Suggestion for a new script: bp_repeat_mask_sequence.pl

Dan Bolser dan.bolser at gmail.com
Tue Jul 6 12:37:05 UTC 2010


Hello,

I'd like to submit a script, 'bp_repeat_mask_sequence.pl'(?), to the
set of scripts in BioPerl. Below is what I have so far. The script
works by reading in a GFF of 'repeat_region's and a fasta file of
sequences. It outputs a fasta sequence file with the repeats replaced
by Xs.

The script clearly needs to be more configurable, but I thought I'd
send it along now to see if I'm working along the right lines, or if I
should be using a different approach.

Comments?


Cheers,
Dan.



#!/usr/bin/perl -w

use strict;
use Getopt::Long;

use Bio::SeqIO;
use Bio::FeatureIO;



## Set options

my $verbose = 0;
my $seq_file;
my $gff_file;

GetOptions
  (
   "verbose" => \$verbose,
   "seq=s" => \$seq_file,
   "gff=s" => \$gff_file,
  )
  or die "failed to parse command line options\n";

die "fail $gff_file : $!\n"
  unless -s $gff_file;



## Set up the BioPerl objects

my $seq_reader =
  Bio::SeqIO->new( -file => $seq_file,
		   -format => 'fasta'
		 );

my $seq_writer =
  Bio::SeqIO->new( -fh => \*STDOUT,
		   -format => 'fasta',
		   -width => 80
		 );

my $gff_reader =
  Bio::FeatureIO->new( -file => $gff_file,
		       -format => 'GFF',
		     );

#warn $seq_reader->width, "\n"; exit;



## Run

my (%repeats, $c);

while ( my $feature = $gff_reader->next_feature() ) {
  if($verbose>1){
    print
      join("\t", #$feature,
	   $feature->seq_id,
	   $feature->type->name,
	   $feature->start,
	   $feature->end,
	  ), "\n";
  }

  if($feature->type->name eq 'repeat_region'){
    $c++;
    push @{$repeats{ $feature->seq_id }},
      [$feature->start,
       $feature->end];
  }

  # Debugging
  #last if $c > 100;
}

warn "read $c repeat_region features for ",
  scalar keys(%repeats), " sequences\n";



##

while(my $seq = $seq_reader->next_seq){
  my $id = $seq->id;
  my $sequence = $seq->seq;

  print $id, "\n"
    if $verbose > 0;

  print length($sequence), "\n"
    if $verbose > 0;

  for my $region (@{$repeats{ $id }}){
    my ($start, $end) = @$region;
    print "$start\t$end\n"
      if $verbose > 1;

    substr($sequence, $start, $end - $start, 'X' x ($end - $start));
  }

  print length($sequence), "\n"
    if $verbose > 0;

  $seq->seq($sequence);

  $seq_writer->write_seq($seq);

  # Debugging;
  #last;
}

warn "OK\n";



More information about the Bioperl-l mailing list