[Bioperl-l] Finding locations of a string within a fasta file
    Chris Fields 
    cjfields at uiuc.edu
       
    Sat Jul 15 17:22:15 EDT 2006
    
    
  
You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if
the format is set to 'gb' (it is now set to 'gbwithparts' by default.  The
CONTIG lines are currently stored in a series of
Bio::Annotation::SimpleValue objects; get the accessions using the following
script.  
use strict;
use warnings;
use Bio::DB::GenBank;
my $factory = Bio::DB::GenBank->new(-format => 'gb');
my $seq = $factory->get_Seq_by_id(shift);
my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
                             -format => 'genbank');
# greps only annotations with CONTIG tagname, joins all together
my $contig = join '', grep {$_->tagname eq 'CONTIG'}
$seq->get_Annotations();
# split each region, getting rid of gaps and join(), then split into
acc/span
for (grep {$_ !~ m{gap|join}}
     split ',', $contig) {
    my ($acc, $span) = split ':', $_;
    $span =~ s{\)}{}g; # spurious ')'
    print "ACC: $acc\n\tSpan:$span\n";
}
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Charles Hauser
> Sent: Saturday, July 15, 2006 2:30 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Finding locations of a string within a fasta file
> 
> All,
> 
> I'm trying to determine where (the start .. end positions) within a
> genomic scaffold sequence gaps occur.
> The gaps are denoted as runs of N's.
> 
> Suggestions on how to easily retrieve this would be appreciated.
> 
> ch
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
    
    
More information about the Bioperl-l
mailing list