[Bioperl-l] ncDNA

Wed Mar 8 18:56:30 EST 2006

You know, it works on smaller sequences, but I found a problem with some
genome sequences which I didn't notice before (particularly M.
tuberculosis).  So use this one instead.  The problem lies with the fact
that any CDS that don't have a 'gene' tag it will be overlooked.  I think
using locus_tag is better to catch all genes, and if gene is present switch
it out.  Some genome seqfeatures in GenBank files have gene = locus_tag and
some don't.  Try this version, which pulls out everything and if 'gene' is
present will use it instead of the less-useful 'locus_tag':
------------------------------------------
use Bio::SeqIO;
use Bio::Seq;
use strict;

my $starttime = time;
my $file = shift @ARGV;

my $seqio = Bio::SeqIO->new(-file => "$file",
                            -format => 'genbank');
my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
                             -format => 'fasta');
my @feat = {};                                             
my $seq = $seqio->next_seq;

# grab features, filter for those we want to keep
for my $feature ($seq->get_SeqFeatures) {                
   next unless $feature->primary_tag =~ /CDS|rRNA|tRNA/; 
   for my $value ($feature->get_tag_values('locus_tag')) {
      push @feat, {'gene' => $feature->get_tag_values('gene')
                             ? $feature->get_tag_values('gene')
                             : $value,
                 'start' => $feature->start,             
                 'end' => $feature->end};                
   }
}

# extract and format subseqs
my $count = 0;
while($feat[$count]) {
   if (defined $feat[$count]->{'gene'}) {
      # get end of previous gene; if no gene (undef), set to 1 (startpoint)
      my $start = $feat[$count-1]->{'end'} ? $feat[$count-1]->{'end'} + 1 :
1 ;
      # get start of second gene (endpoint)
      my $end = $feat[$count]->{'start'} - 1; 
      my $igrname =  $feat[$count-1]->{'gene'}
                     ?
$feat[$count-1]->{'gene'}."-".$feat[$count]->{'gene'}.' | '.
                        $start."..".$end
                     : $feat[$count]->{'gene'}.'|'.
                        $start."..".$end;
      my $igr = $end - $start;
      if($igr > 0 ) {  # filter out negative or zero values
         my $sub = $seq->subseq($start, $end);
         my $subseq = Bio::Seq->new(-seq   => $sub,
                                    -desc  => $igrname);
         $seqout->write_seq($subseq);
      }
   }
   $count++;
}
------------------------------------------

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> Sent: Wednesday, March 08, 2006 4:48 PM
> To: perlmails at gmail.com; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] ncDNA
> 
> It's a good idea to mail to the list as well; others may have better or
> alternative ways of doing this.
> 
> 
> 
> Jason mentions masking all regions you don't want (CDS, rRNA, tRNA) by
> changing them to N's, then splitting on multiple N's:
> 
> 
> 
> http://portal.open-bio.org/pipermail/bioperl-l/2005-January/018092.html
> 
> 
> 
> Using substr (as Jason suggests) should also speed things up.
> 
> 
> 
> Here's my solution using only bioperl objects; I use this for some work I
> do
> here.  It works but isn't terribly fast with large sequences due to object
> creation. There is very likely a better way with larger sequences.  You
> could probably use the first loop to extract seqfeatures and change the
> second part to simple perl to speed things up a bit or try straight regex
> matching combined with substr for maximum velocity.  Be careful though;
> bioperl subseq() searches starting with 1 while I think perl substr starts
> at 0!