[Bioperl-l] Suggestion for a new script: bp_repeat_mask_sequence.pl

Mon Jul 12 09:34:47 UTC 2010

OK, here is my final version. I have tested it with GFF / Fasta, and
in theory it works with all installed BioPerl sequence / feature file
formats, however, I haven't tested any of those.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

use Bio::SeqIO;
use Bio::FeatureIO;

## Set options

my $verbose = 0;

my $feature_file;
my $sequence_file;

my $feature_format  = 'GFF';
my $sequence_format = 'Fasta';

my $feature_to_mask = 'repeat_region';

my $seq_mask_character = 'X';

GetOptions
  (
   "verbose"    => \$verbose,

   "feature_file|f=s"  => \$feature_file,
   "sequence_file|s=s" => \$sequence_file,

   "feature_format|ff=s"  => \$feature_format,
   "sequence_format|sf=s" => \$sequence_format,

   "feature_to_mask|m=s"  => \$feature_to_mask,

   "seq_mask_character|c=s" => \$seq_mask_character,
  )
  or die "failed to parse command line options\n";

## Check options

## A value should be passed
die usage() unless
  $sequence_file &&
  $feature_file;

## The files should exist
die "problem with feature file '$feature_file' : $!\n"
  unless -s $feature_file;
die "problem with sequence file '$sequence_file' : $!\n"
  unless -s $sequence_file;

## The formats should be valid
$feature_format = lc($feature_format);
die "ERROR: feature format '$feature_format' is not supported!\n\n"
  unless eval( "require Bio::FeatureIO::$feature_format" );

$sequence_format = lc($sequence_format);
die "ERROR: sequence format '$sequence_format' is not supported!\n\n"
  unless eval( "require Bio::SeqIO::$sequence_format" );

## Erm...
die "1: what are you trying to do?\n"
  unless $feature_to_mask;

die "2: what are you trying to do?\n"
  unless length($seq_mask_character) == 1;

=head1 NAME

 bp_repeat_mask_sequence.pl - mask sequence features

=head1 DESCRIPTION

 Takes an input sequence file and a feature file, and returns the
 sequence with 'repeat_region' features masked out (replaced with
 X's). This is useful for downstream processing of the sequence file.

 The masked sequence is written to STDOUT.

=head1 USAGE

 bp_repeat_mask_sequence.pl <options>

 Options:

    -f
    --feature_file        The file from which the sequence features will
                          be read (for subsequent masking).

    -s
    --sequence_file       The sequence file (to be  masked).

    --ff
    --feature_format      The format of the feature file
                          (the default is GFF).

    --sf
    --sequence_format     The format of the sequence file
                          (the default is fasta).

    -m
    --feature_to_mask     The type of feature to mask
                          (the default is 'repeat_region').

    -c
    --seq_mask_character  The 'mask' character to use in the sequence.
                          (the default is 'X').

    -v
    --verbose             Generate some debugging output

=cut

## Set up the BioPerl objects

my $gff_reader =
  Bio::FeatureIO->new( -file => $feature_file,
		       -format => $feature_format
		     );

my $seq_reader =
  Bio::SeqIO->new( -file => $sequence_file,
		   -format => $sequence_format,
		 );

my $seq_writer =
  Bio::SeqIO->new( -fh => \*STDOUT,
		   -format => $sequence_format,
		 );

## Run

warn "hashing features to mask\n";

my (%repeats, $c);

while ( my $feature = $gff_reader->next_feature() ) {
  if($verbose>0){
    print
      join("\t", #$feature,
	   $feature->seq_id,
	   $feature->type->name,
	   $feature->start,
	   $feature->end,
	  ), "\n";
  }

  if($feature->type->name eq $feature_to_mask){
    $c++;
    push @{$repeats{ $feature->seq_id }},
      [$feature->start,
       $feature->end];
  }
}

warn "read $c '$feature_to_mask' features for ",
  scalar keys(%repeats), " sequences\n";

warn "masking sequences\n";

while(my $seq = $seq_reader->next_seq){
  my $id = $seq->id;
  my $sequence = $seq->seq;

  print $id, "\n"
    if $verbose > 0;

  ## Do the masking
  for my $region (@{$repeats{ $id }}){
    my ($start, $end) = @$region;
    print "$start\t$end\n"
      if $verbose > 1;

    substr($sequence, $start, $end - $start,
	   $seq_mask_character x ($end - $start)
	  );
  }

  $seq->seq($sequence);

  $seq_writer->write_seq($seq);
}

warn "done\n";

# A bit of a hack:
sub usage{
  `perldoc -T ./$0`
}

On 7 July 2010 10:42, Brian Osborne <bosborne11 at verizon.net> wrote:
> Dan,
>
> In my opinion the user should be able to use any supported format as input, yes.
>
> Brian O.
>
> On Jul 7, 2010, at 1:14 AM, Dan Bolser wrote:
>
>> Cheers Brian!
>>
>> Do you think this script will work if I allow sequence and feature
>> '-format's to be picked by the user among all those listed as valid
>> file formats in BioPerl?
>>
>> http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats
>> http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/FeatureIO.pm#SUPPORTED_FORMATS
>>
>>
>> Or should I stick to Fasta/GFF (or some other implementation)?
>>
>> Cheers,
>> Dan.
>>
>>
>> On 6 July 2010 15:50, Brian Osborne <bosborne11 at verizon.net> wrote:
>>> Dan,
>>>
>>> There are 2 different directories for scripts, examples/ and scripts/. The examples/ directory accepts any sort of script. The scripts/ directory scripts can be installed when Bioperl is installed, if the user wishes. The guidelines are that scripts/ directory scripts should accept command-line arguments (which yours does), should be named with the suffix 'PLS', and should have POD documentation. So, all you need is some POD. Here's some example POD:
>>>
>>> =head1 NAME
>>>
>>> bioflat_index.pl - index sequence files using Bio::DB::Flat
>>>
>>> =head1 DESCRIPTION
>>>
>>> Create or update a biological sequence database indexed with the
>>> Bio::DB::Flat indexing scheme.  The arguments are a list of flat files
>>> containing the sequence information to be indexed.
>>>
>>> =head1 USAGE
>>>
>>> bioflat_index.pl <options> file1 file2 file3...
>>>
>>> Options:
>>>
>>>   --create              Create or reinitialize the index.  If not specified,
>>>                         the index must already exist.
>>>
>>>   --format   <format>   The format of the sequence files.  Must be one
>>>                         of "genbank", "swissprot", "embl" or "fasta".
>>>
>>>   --location <path>     Path to the directory in which the index files
>>>                         are stored.
>>>
>>>   --dbname <name>       The symbolic name of the database to be created.
>>>
>>>   --indextype <type>    Type of index to create.  Either "bdb" or "flat".
>>>                         "binarysearch" is the same as "flat".
>>>
>>> Options can be abbreviated.  For example, use -i for --indextype.
>>>
>>> The following environment variables will be used as defaults if the
>>> corresponding options are not provided:
>>>
>>>   OBDA_FORMAT      format of sequence file
>>>   OBDA_LOCATION    path to directory in which index files are stored
>>>   OBDA_DBNAME      name of database
>>>   OBDA_INDEX       type of index to create
>>>
>>> =cut
>>>
>>>
>>> On Jul 6, 2010, at 2:37 PM, Dan Bolser wrote:
>>>
>>>> Hello,
>>>>
>>>> I'd like to submit a script, 'bp_repeat_mask_sequence.pl'(?), to the
>>>> set of scripts in BioPerl. Below is what I have so far. The script
>>>> works by reading in a GFF of 'repeat_region's and a fasta file of
>>>> sequences. It outputs a fasta sequence file with the repeats replaced
>>>> by Xs.
>>>>
>>>> The script clearly needs to be more configurable, but I thought I'd
>>>> send it along now to see if I'm working along the right lines, or if I
>>>> should be using a different approach.
>>>>
>>>> Comments?
>>>>
>>>>
>>>> Cheers,
>>>> Dan.
>>>>
>>>>
>>>>
>>>> #!/usr/bin/perl -w
>>>>
>>>> use strict;
>>>> use Getopt::Long;
>>>>
>>>> use Bio::SeqIO;
>>>> use Bio::FeatureIO;
>>>>
>>>>
>>>>
>>>> ## Set options
>>>>
>>>> my $verbose = 0;
>>>> my $seq_file;
>>>> my $gff_file;
>>>>
>>>> GetOptions
>>>> (
>>>> "verbose" => \$verbose,
>>>> "seq=s" => \$seq_file,
>>>> "gff=s" => \$gff_file,
>>>> )
>>>> or die "failed to parse command line options\n";
>>>>
>>>> die "fail $gff_file : $!\n"
>>>> unless -s $gff_file;
>>>>
>>>>
>>>>
>>>> ## Set up the BioPerl objects
>>>>
>>>> my $seq_reader =
>>>> Bio::SeqIO->new( -file => $seq_file,
>>>>                -format => 'fasta'
>>>>              );
>>>>
>>>> my $seq_writer =
>>>> Bio::SeqIO->new( -fh => \*STDOUT,
>>>>                -format => 'fasta',
>>>>                -width => 80
>>>>              );
>>>>
>>>> my $gff_reader =
>>>> Bio::FeatureIO->new( -file => $gff_file,
>>>>                    -format => 'GFF',
>>>>                  );
>>>>
>>>> #warn $seq_reader->width, "\n"; exit;
>>>>
>>>>
>>>>
>>>> ## Run
>>>>
>>>> my (%repeats, $c);
>>>>
>>>> while ( my $feature = $gff_reader->next_feature() ) {
>>>> if($verbose>1){
>>>>  print
>>>>    join("\t", #$feature,
>>>>        $feature->seq_id,
>>>>        $feature->type->name,
>>>>        $feature->start,
>>>>        $feature->end,
>>>>       ), "\n";
>>>> }
>>>>
>>>> if($feature->type->name eq 'repeat_region'){
>>>>  $c++;
>>>>  push @{$repeats{ $feature->seq_id }},
>>>>    [$feature->start,
>>>>     $feature->end];
>>>> }
>>>>
>>>> # Debugging
>>>> #last if $c > 100;
>>>> }
>>>>
>>>> warn "read $c repeat_region features for ",
>>>> scalar keys(%repeats), " sequences\n";
>>>>
>>>>
>>>>
>>>> ##
>>>>
>>>> while(my $seq = $seq_reader->next_seq){
>>>> my $id = $seq->id;
>>>> my $sequence = $seq->seq;
>>>>
>>>> print $id, "\n"
>>>>  if $verbose > 0;
>>>>
>>>> print length($sequence), "\n"
>>>>  if $verbose > 0;
>>>>
>>>> for my $region (@{$repeats{ $id }}){
>>>>  my ($start, $end) = @$region;
>>>>  print "$start\t$end\n"
>>>>    if $verbose > 1;
>>>>
>>>>  substr($sequence, $start, $end - $start, 'X' x ($end - $start));
>>>> }
>>>>
>>>> print length($sequence), "\n"
>>>>  if $verbose > 0;
>>>>
>>>> $seq->seq($sequence);
>>>>
>>>> $seq_writer->write_seq($seq);
>>>>
>>>> # Debugging;
>>>> #last;
>>>> }
>>>>
>>>> warn "OK\n";
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>
>