[Bioperl-l] Suggestion for a new script: bp_repeat_mask_sequence.pl

Wed Jul 7 09:42:59 UTC 2010

Dan,

In my opinion the user should be able to use any supported format as input, yes.

Brian O.

On Jul 7, 2010, at 1:14 AM, Dan Bolser wrote:

> Cheers Brian!
> 
> Do you think this script will work if I allow sequence and feature
> '-format's to be picked by the user among all those listed as valid
> file formats in BioPerl?
> 
> http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats
> http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/FeatureIO.pm#SUPPORTED_FORMATS
> 
> 
> Or should I stick to Fasta/GFF (or some other implementation)?
> 
> Cheers,
> Dan.
> 
> 
> On 6 July 2010 15:50, Brian Osborne <bosborne11 at verizon.net> wrote:
>> Dan,
>> 
>> There are 2 different directories for scripts, examples/ and scripts/. The examples/ directory accepts any sort of script. The scripts/ directory scripts can be installed when Bioperl is installed, if the user wishes. The guidelines are that scripts/ directory scripts should accept command-line arguments (which yours does), should be named with the suffix 'PLS', and should have POD documentation. So, all you need is some POD. Here's some example POD:
>> 
>> =head1 NAME
>> 
>> bioflat_index.pl - index sequence files using Bio::DB::Flat
>> 
>> =head1 DESCRIPTION
>> 
>> Create or update a biological sequence database indexed with the
>> Bio::DB::Flat indexing scheme.  The arguments are a list of flat files
>> containing the sequence information to be indexed.
>> 
>> =head1 USAGE
>> 
>> bioflat_index.pl <options> file1 file2 file3...
>> 
>> Options:
>> 
>>   --create              Create or reinitialize the index.  If not specified,
>>                         the index must already exist.
>> 
>>   --format   <format>   The format of the sequence files.  Must be one
>>                         of "genbank", "swissprot", "embl" or "fasta".
>> 
>>   --location <path>     Path to the directory in which the index files
>>                         are stored.
>> 
>>   --dbname <name>       The symbolic name of the database to be created.
>> 
>>   --indextype <type>    Type of index to create.  Either "bdb" or "flat".
>>                         "binarysearch" is the same as "flat".
>> 
>> Options can be abbreviated.  For example, use -i for --indextype.
>> 
>> The following environment variables will be used as defaults if the
>> corresponding options are not provided:
>> 
>>   OBDA_FORMAT      format of sequence file
>>   OBDA_LOCATION    path to directory in which index files are stored
>>   OBDA_DBNAME      name of database
>>   OBDA_INDEX       type of index to create
>> 
>> =cut
>> 
>> 
>> On Jul 6, 2010, at 2:37 PM, Dan Bolser wrote:
>> 
>>> Hello,
>>> 
>>> I'd like to submit a script, 'bp_repeat_mask_sequence.pl'(?), to the
>>> set of scripts in BioPerl. Below is what I have so far. The script
>>> works by reading in a GFF of 'repeat_region's and a fasta file of
>>> sequences. It outputs a fasta sequence file with the repeats replaced
>>> by Xs.
>>> 
>>> The script clearly needs to be more configurable, but I thought I'd
>>> send it along now to see if I'm working along the right lines, or if I
>>> should be using a different approach.
>>> 
>>> Comments?
>>> 
>>> 
>>> Cheers,
>>> Dan.
>>> 
>>> 
>>> 
>>> #!/usr/bin/perl -w
>>> 
>>> use strict;
>>> use Getopt::Long;
>>> 
>>> use Bio::SeqIO;
>>> use Bio::FeatureIO;
>>> 
>>> 
>>> 
>>> ## Set options
>>> 
>>> my $verbose = 0;
>>> my $seq_file;
>>> my $gff_file;
>>> 
>>> GetOptions
>>> (
>>> "verbose" => \$verbose,
>>> "seq=s" => \$seq_file,
>>> "gff=s" => \$gff_file,
>>> )
>>> or die "failed to parse command line options\n";
>>> 
>>> die "fail $gff_file : $!\n"
>>> unless -s $gff_file;
>>> 
>>> 
>>> 
>>> ## Set up the BioPerl objects
>>> 
>>> my $seq_reader =
>>> Bio::SeqIO->new( -file => $seq_file,
>>>                -format => 'fasta'
>>>              );
>>> 
>>> my $seq_writer =
>>> Bio::SeqIO->new( -fh => \*STDOUT,
>>>                -format => 'fasta',
>>>                -width => 80
>>>              );
>>> 
>>> my $gff_reader =
>>> Bio::FeatureIO->new( -file => $gff_file,
>>>                    -format => 'GFF',
>>>                  );
>>> 
>>> #warn $seq_reader->width, "\n"; exit;
>>> 
>>> 
>>> 
>>> ## Run
>>> 
>>> my (%repeats, $c);
>>> 
>>> while ( my $feature = $gff_reader->next_feature() ) {
>>> if($verbose>1){
>>>  print
>>>    join("\t", #$feature,
>>>        $feature->seq_id,
>>>        $feature->type->name,
>>>        $feature->start,
>>>        $feature->end,
>>>       ), "\n";
>>> }
>>> 
>>> if($feature->type->name eq 'repeat_region'){
>>>  $c++;
>>>  push @{$repeats{ $feature->seq_id }},
>>>    [$feature->start,
>>>     $feature->end];
>>> }
>>> 
>>> # Debugging
>>> #last if $c > 100;
>>> }
>>> 
>>> warn "read $c repeat_region features for ",
>>> scalar keys(%repeats), " sequences\n";
>>> 
>>> 
>>> 
>>> ##
>>> 
>>> while(my $seq = $seq_reader->next_seq){
>>> my $id = $seq->id;
>>> my $sequence = $seq->seq;
>>> 
>>> print $id, "\n"
>>>  if $verbose > 0;
>>> 
>>> print length($sequence), "\n"
>>>  if $verbose > 0;
>>> 
>>> for my $region (@{$repeats{ $id }}){
>>>  my ($start, $end) = @$region;
>>>  print "$start\t$end\n"
>>>    if $verbose > 1;
>>> 
>>>  substr($sequence, $start, $end - $start, 'X' x ($end - $start));
>>> }
>>> 
>>> print length($sequence), "\n"
>>>  if $verbose > 0;
>>> 
>>> $seq->seq($sequence);
>>> 
>>> $seq_writer->write_seq($seq);
>>> 
>>> # Debugging;
>>> #last;
>>> }
>>> 
>>> warn "OK\n";
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>>