[Bioperl-l] Suggestion for a new script: bp_repeat_mask_sequence.pl
Dan Bolser
dan.bolser at gmail.com
Mon Jul 12 09:34:47 UTC 2010
OK, here is my final version. I have tested it with GFF / Fasta, and
in theory it works with all installed BioPerl sequence / feature file
formats, however, I haven't tested any of those.
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use Bio::SeqIO;
use Bio::FeatureIO;
## Set options
my $verbose = 0;
my $feature_file;
my $sequence_file;
my $feature_format = 'GFF';
my $sequence_format = 'Fasta';
my $feature_to_mask = 'repeat_region';
my $seq_mask_character = 'X';
GetOptions
(
"verbose" => \$verbose,
"feature_file|f=s" => \$feature_file,
"sequence_file|s=s" => \$sequence_file,
"feature_format|ff=s" => \$feature_format,
"sequence_format|sf=s" => \$sequence_format,
"feature_to_mask|m=s" => \$feature_to_mask,
"seq_mask_character|c=s" => \$seq_mask_character,
)
or die "failed to parse command line options\n";
## Check options
## A value should be passed
die usage() unless
$sequence_file &&
$feature_file;
## The files should exist
die "problem with feature file '$feature_file' : $!\n"
unless -s $feature_file;
die "problem with sequence file '$sequence_file' : $!\n"
unless -s $sequence_file;
## The formats should be valid
$feature_format = lc($feature_format);
die "ERROR: feature format '$feature_format' is not supported!\n\n"
unless eval( "require Bio::FeatureIO::$feature_format" );
$sequence_format = lc($sequence_format);
die "ERROR: sequence format '$sequence_format' is not supported!\n\n"
unless eval( "require Bio::SeqIO::$sequence_format" );
## Erm...
die "1: what are you trying to do?\n"
unless $feature_to_mask;
die "2: what are you trying to do?\n"
unless length($seq_mask_character) == 1;
=head1 NAME
bp_repeat_mask_sequence.pl - mask sequence features
=head1 DESCRIPTION
Takes an input sequence file and a feature file, and returns the
sequence with 'repeat_region' features masked out (replaced with
X's). This is useful for downstream processing of the sequence file.
The masked sequence is written to STDOUT.
=head1 USAGE
bp_repeat_mask_sequence.pl <options>
Options:
-f
--feature_file The file from which the sequence features will
be read (for subsequent masking).
-s
--sequence_file The sequence file (to be masked).
--ff
--feature_format The format of the feature file
(the default is GFF).
--sf
--sequence_format The format of the sequence file
(the default is fasta).
-m
--feature_to_mask The type of feature to mask
(the default is 'repeat_region').
-c
--seq_mask_character The 'mask' character to use in the sequence.
(the default is 'X').
-v
--verbose Generate some debugging output
=cut
## Set up the BioPerl objects
my $gff_reader =
Bio::FeatureIO->new( -file => $feature_file,
-format => $feature_format
);
my $seq_reader =
Bio::SeqIO->new( -file => $sequence_file,
-format => $sequence_format,
);
my $seq_writer =
Bio::SeqIO->new( -fh => \*STDOUT,
-format => $sequence_format,
);
## Run
warn "hashing features to mask\n";
my (%repeats, $c);
while ( my $feature = $gff_reader->next_feature() ) {
if($verbose>0){
print
join("\t", #$feature,
$feature->seq_id,
$feature->type->name,
$feature->start,
$feature->end,
), "\n";
}
if($feature->type->name eq $feature_to_mask){
$c++;
push @{$repeats{ $feature->seq_id }},
[$feature->start,
$feature->end];
}
}
warn "read $c '$feature_to_mask' features for ",
scalar keys(%repeats), " sequences\n";
warn "masking sequences\n";
while(my $seq = $seq_reader->next_seq){
my $id = $seq->id;
my $sequence = $seq->seq;
print $id, "\n"
if $verbose > 0;
## Do the masking
for my $region (@{$repeats{ $id }}){
my ($start, $end) = @$region;
print "$start\t$end\n"
if $verbose > 1;
substr($sequence, $start, $end - $start,
$seq_mask_character x ($end - $start)
);
}
$seq->seq($sequence);
$seq_writer->write_seq($seq);
}
warn "done\n";
# A bit of a hack:
sub usage{
`perldoc -T ./$0`
}
On 7 July 2010 10:42, Brian Osborne <bosborne11 at verizon.net> wrote:
> Dan,
>
> In my opinion the user should be able to use any supported format as input, yes.
>
> Brian O.
>
> On Jul 7, 2010, at 1:14 AM, Dan Bolser wrote:
>
>> Cheers Brian!
>>
>> Do you think this script will work if I allow sequence and feature
>> '-format's to be picked by the user among all those listed as valid
>> file formats in BioPerl?
>>
>> http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats
>> http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/FeatureIO.pm#SUPPORTED_FORMATS
>>
>>
>> Or should I stick to Fasta/GFF (or some other implementation)?
>>
>> Cheers,
>> Dan.
>>
>>
>> On 6 July 2010 15:50, Brian Osborne <bosborne11 at verizon.net> wrote:
>>> Dan,
>>>
>>> There are 2 different directories for scripts, examples/ and scripts/. The examples/ directory accepts any sort of script. The scripts/ directory scripts can be installed when Bioperl is installed, if the user wishes. The guidelines are that scripts/ directory scripts should accept command-line arguments (which yours does), should be named with the suffix 'PLS', and should have POD documentation. So, all you need is some POD. Here's some example POD:
>>>
>>> =head1 NAME
>>>
>>> bioflat_index.pl - index sequence files using Bio::DB::Flat
>>>
>>> =head1 DESCRIPTION
>>>
>>> Create or update a biological sequence database indexed with the
>>> Bio::DB::Flat indexing scheme. The arguments are a list of flat files
>>> containing the sequence information to be indexed.
>>>
>>> =head1 USAGE
>>>
>>> bioflat_index.pl <options> file1 file2 file3...
>>>
>>> Options:
>>>
>>> --create Create or reinitialize the index. If not specified,
>>> the index must already exist.
>>>
>>> --format <format> The format of the sequence files. Must be one
>>> of "genbank", "swissprot", "embl" or "fasta".
>>>
>>> --location <path> Path to the directory in which the index files
>>> are stored.
>>>
>>> --dbname <name> The symbolic name of the database to be created.
>>>
>>> --indextype <type> Type of index to create. Either "bdb" or "flat".
>>> "binarysearch" is the same as "flat".
>>>
>>> Options can be abbreviated. For example, use -i for --indextype.
>>>
>>> The following environment variables will be used as defaults if the
>>> corresponding options are not provided:
>>>
>>> OBDA_FORMAT format of sequence file
>>> OBDA_LOCATION path to directory in which index files are stored
>>> OBDA_DBNAME name of database
>>> OBDA_INDEX type of index to create
>>>
>>> =cut
>>>
>>>
>>> On Jul 6, 2010, at 2:37 PM, Dan Bolser wrote:
>>>
>>>> Hello,
>>>>
>>>> I'd like to submit a script, 'bp_repeat_mask_sequence.pl'(?), to the
>>>> set of scripts in BioPerl. Below is what I have so far. The script
>>>> works by reading in a GFF of 'repeat_region's and a fasta file of
>>>> sequences. It outputs a fasta sequence file with the repeats replaced
>>>> by Xs.
>>>>
>>>> The script clearly needs to be more configurable, but I thought I'd
>>>> send it along now to see if I'm working along the right lines, or if I
>>>> should be using a different approach.
>>>>
>>>> Comments?
>>>>
>>>>
>>>> Cheers,
>>>> Dan.
>>>>
>>>>
>>>>
>>>> #!/usr/bin/perl -w
>>>>
>>>> use strict;
>>>> use Getopt::Long;
>>>>
>>>> use Bio::SeqIO;
>>>> use Bio::FeatureIO;
>>>>
>>>>
>>>>
>>>> ## Set options
>>>>
>>>> my $verbose = 0;
>>>> my $seq_file;
>>>> my $gff_file;
>>>>
>>>> GetOptions
>>>> (
>>>> "verbose" => \$verbose,
>>>> "seq=s" => \$seq_file,
>>>> "gff=s" => \$gff_file,
>>>> )
>>>> or die "failed to parse command line options\n";
>>>>
>>>> die "fail $gff_file : $!\n"
>>>> unless -s $gff_file;
>>>>
>>>>
>>>>
>>>> ## Set up the BioPerl objects
>>>>
>>>> my $seq_reader =
>>>> Bio::SeqIO->new( -file => $seq_file,
>>>> -format => 'fasta'
>>>> );
>>>>
>>>> my $seq_writer =
>>>> Bio::SeqIO->new( -fh => \*STDOUT,
>>>> -format => 'fasta',
>>>> -width => 80
>>>> );
>>>>
>>>> my $gff_reader =
>>>> Bio::FeatureIO->new( -file => $gff_file,
>>>> -format => 'GFF',
>>>> );
>>>>
>>>> #warn $seq_reader->width, "\n"; exit;
>>>>
>>>>
>>>>
>>>> ## Run
>>>>
>>>> my (%repeats, $c);
>>>>
>>>> while ( my $feature = $gff_reader->next_feature() ) {
>>>> if($verbose>1){
>>>> print
>>>> join("\t", #$feature,
>>>> $feature->seq_id,
>>>> $feature->type->name,
>>>> $feature->start,
>>>> $feature->end,
>>>> ), "\n";
>>>> }
>>>>
>>>> if($feature->type->name eq 'repeat_region'){
>>>> $c++;
>>>> push @{$repeats{ $feature->seq_id }},
>>>> [$feature->start,
>>>> $feature->end];
>>>> }
>>>>
>>>> # Debugging
>>>> #last if $c > 100;
>>>> }
>>>>
>>>> warn "read $c repeat_region features for ",
>>>> scalar keys(%repeats), " sequences\n";
>>>>
>>>>
>>>>
>>>> ##
>>>>
>>>> while(my $seq = $seq_reader->next_seq){
>>>> my $id = $seq->id;
>>>> my $sequence = $seq->seq;
>>>>
>>>> print $id, "\n"
>>>> if $verbose > 0;
>>>>
>>>> print length($sequence), "\n"
>>>> if $verbose > 0;
>>>>
>>>> for my $region (@{$repeats{ $id }}){
>>>> my ($start, $end) = @$region;
>>>> print "$start\t$end\n"
>>>> if $verbose > 1;
>>>>
>>>> substr($sequence, $start, $end - $start, 'X' x ($end - $start));
>>>> }
>>>>
>>>> print length($sequence), "\n"
>>>> if $verbose > 0;
>>>>
>>>> $seq->seq($sequence);
>>>>
>>>> $seq_writer->write_seq($seq);
>>>>
>>>> # Debugging;
>>>> #last;
>>>> }
>>>>
>>>> warn "OK\n";
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>
>
More information about the Bioperl-l
mailing list