[Bioperl-l] Indexing CDS file

Dave Messina David.Messina at sbc.su.se
Tue Feb 10 12:37:32 UTC 2009


Hi Sviya,

You've almost got it. You need to use Bio::Index::EMBL if you're going to be
indexing EMBL-format files.

I modified your code to do so, and it worked for me (attached below). Please
try it and report back if you have any problems.

One thing: note that there aren't any 'AC' lines in that file you're
indexing, so I saw errors like this:

--------------------- WARNING ---------------------
MSG: For id [EAL24318;] in embl flat file, got no accession number. Storing
id index anyway
---------------------------------------------------


I did see accession-like identifiers on the 'PA' lines, however. Does
anybody know if this is a change in EMBL format that we need to adapt
B::I::EMBL to?


Dave




----------------------------CODE BEGINS-----------------------------
#!/usr/bin/perl -w

use strict;

use Bio::SeqIO;
use Bio::Index::EMBL;

my $file     = "cds100.dat";
my $idx_file = "cds100.idx";

unlink($idx_file); # delete the file, if exists

my $in_seq = Bio::SeqIO->new(   -file => "< $file",
                                -format => "embl");

while ( my $seq = $in_seq->next_seq) {
    print $seq->id, "\n";
}

my $inx = Bio::Index::EMBL->new(-filename => $idx_file,
                                 -write_flag => 'WRITE');

my $retval = $inx->make_index($file,); # creates the index file

print "Done indexing!\n";

my $out = Bio::SeqIO->new( '-format' => 'fasta',
                           '-fh'     => \*STDOUT);


my $query_seq = $inx->fetch('EAL24309;');

if ( ! defined($query_seq) ) {
    print "Sequence not found!\n";
    exit 1;
} else {
    $out->write_seq($query_seq);
}

exit 0;
-------------------------CODE ENDS--------------------------------



More information about the Bioperl-l mailing list