[Bioperl-l] Suggestion for a new script: bp_repeat_mask_sequence.pl

Tue Jul 6 14:50:25 UTC 2010

Dan,

There are 2 different directories for scripts, examples/ and scripts/. The examples/ directory accepts any sort of script. The scripts/ directory scripts can be installed when Bioperl is installed, if the user wishes. The guidelines are that scripts/ directory scripts should accept command-line arguments (which yours does), should be named with the suffix 'PLS', and should have POD documentation. So, all you need is some POD. Here's some example POD:

=head1 NAME

bioflat_index.pl - index sequence files using Bio::DB::Flat

=head1 DESCRIPTION

 Create or update a biological sequence database indexed with the
 Bio::DB::Flat indexing scheme.  The arguments are a list of flat files
 containing the sequence information to be indexed.

=head1 USAGE

 bioflat_index.pl <options> file1 file2 file3...

 Options:

     --create              Create or reinitialize the index.  If not specified,
                           the index must already exist.

     --format   <format>   The format of the sequence files.  Must be one
                           of "genbank", "swissprot", "embl" or "fasta".

     --location <path>     Path to the directory in which the index files
                           are stored.

     --dbname <name>       The symbolic name of the database to be created.

     --indextype <type>    Type of index to create.  Either "bdb" or "flat".
                           "binarysearch" is the same as "flat".

Options can be abbreviated.  For example, use -i for --indextype.

The following environment variables will be used as defaults if the 
corresponding options are not provided:

     OBDA_FORMAT      format of sequence file
     OBDA_LOCATION    path to directory in which index files are stored
     OBDA_DBNAME      name of database
     OBDA_INDEX       type of index to create

=cut

On Jul 6, 2010, at 2:37 PM, Dan Bolser wrote:

> Hello,
> 
> I'd like to submit a script, 'bp_repeat_mask_sequence.pl'(?), to the
> set of scripts in BioPerl. Below is what I have so far. The script
> works by reading in a GFF of 'repeat_region's and a fasta file of
> sequences. It outputs a fasta sequence file with the repeats replaced
> by Xs.
> 
> The script clearly needs to be more configurable, but I thought I'd
> send it along now to see if I'm working along the right lines, or if I
> should be using a different approach.
> 
> Comments?
> 
> 
> Cheers,
> Dan.
> 
> 
> 
> #!/usr/bin/perl -w
> 
> use strict;
> use Getopt::Long;
> 
> use Bio::SeqIO;
> use Bio::FeatureIO;
> 
> 
> 
> ## Set options
> 
> my $verbose = 0;
> my $seq_file;
> my $gff_file;
> 
> GetOptions
>  (
>   "verbose" => \$verbose,
>   "seq=s" => \$seq_file,
>   "gff=s" => \$gff_file,
>  )
>  or die "failed to parse command line options\n";
> 
> die "fail $gff_file : $!\n"
>  unless -s $gff_file;
> 
> 
> 
> ## Set up the BioPerl objects
> 
> my $seq_reader =
>  Bio::SeqIO->new( -file => $seq_file,
> 		   -format => 'fasta'
> 		 );
> 
> my $seq_writer =
>  Bio::SeqIO->new( -fh => \*STDOUT,
> 		   -format => 'fasta',
> 		   -width => 80
> 		 );
> 
> my $gff_reader =
>  Bio::FeatureIO->new( -file => $gff_file,
> 		       -format => 'GFF',
> 		     );
> 
> #warn $seq_reader->width, "\n"; exit;
> 
> 
> 
> ## Run
> 
> my (%repeats, $c);
> 
> while ( my $feature = $gff_reader->next_feature() ) {
>  if($verbose>1){
>    print
>      join("\t", #$feature,
> 	   $feature->seq_id,
> 	   $feature->type->name,
> 	   $feature->start,
> 	   $feature->end,
> 	  ), "\n";
>  }
> 
>  if($feature->type->name eq 'repeat_region'){
>    $c++;
>    push @{$repeats{ $feature->seq_id }},
>      [$feature->start,
>       $feature->end];
>  }
> 
>  # Debugging
>  #last if $c > 100;
> }
> 
> warn "read $c repeat_region features for ",
>  scalar keys(%repeats), " sequences\n";
> 
> 
> 
> ##
> 
> while(my $seq = $seq_reader->next_seq){
>  my $id = $seq->id;
>  my $sequence = $seq->seq;
> 
>  print $id, "\n"
>    if $verbose > 0;
> 
>  print length($sequence), "\n"
>    if $verbose > 0;
> 
>  for my $region (@{$repeats{ $id }}){
>    my ($start, $end) = @$region;
>    print "$start\t$end\n"
>      if $verbose > 1;
> 
>    substr($sequence, $start, $end - $start, 'X' x ($end - $start));
>  }
> 
>  print length($sequence), "\n"
>    if $verbose > 0;
> 
>  $seq->seq($sequence);
> 
>  $seq_writer->write_seq($seq);
> 
>  # Debugging;
>  #last;
> }
> 
> warn "OK\n";
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l