[Bioperl-l] Finding locations of a string within a fasta file

Sat Jul 15 17:22:15 EDT 2006

You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if
the format is set to 'gb' (it is now set to 'gbwithparts' by default.  The
CONTIG lines are currently stored in a series of
Bio::Annotation::SimpleValue objects; get the accessions using the following
script.  

use strict;
use warnings;

use Bio::DB::GenBank;

my $factory = Bio::DB::GenBank->new(-format => 'gb');

my $seq = $factory->get_Seq_by_id(shift);

my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
                             -format => 'genbank');

# greps only annotations with CONTIG tagname, joins all together
my $contig = join '', grep {$_->tagname eq 'CONTIG'}
$seq->get_Annotations();

# split each region, getting rid of gaps and join(), then split into
acc/span
for (grep {$_ !~ m{gap|join}}
     split ',', $contig) {
    my ($acc, $span) = split ':', $_;
    $span =~ s{\)}{}g; # spurious ')'
    print "ACC: $acc\n\tSpan:$span\n";
}

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Charles Hauser
> Sent: Saturday, July 15, 2006 2:30 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Finding locations of a string within a fasta file
> 
> All,
> 
> I'm trying to determine where (the start .. end positions) within a
> genomic scaffold sequence gaps occur.
> The gaps are denoted as runs of N's.
> 
> Suggestions on how to easily retrieve this would be appreciated.
> 
> ch
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l